home *** CD-ROM | disk | FTP | other *** search
Text File | 1995-12-31 | 142.7 KB | 3,581 lines | [TEXT/R*ch] |
- Received-Date: Thu, 16 Jun 1994 13:35:20 +0200
- From: pottier@clipper.ens.fr (Francois Pottier)
- Subject: csmp-digest-v3-036
- To: csmp-digest@ens.fr
- Date: Thu, 16 Jun 1994 13:35:08 +0200 (MET DST)
- X-Mailer: ELM [version 2.4 PL23]
- Mime-Version: 1.0
- Content-Type: text/plain; charset=ISO-8859-1
- Content-Transfer-Encoding: 8bit
- Errors-To: listman@ens.fr
- Reply-To: pottier@clipper.ens.fr
- X-Sequence: 39
-
- C.S.M.P. Digest Thu, 16 Jun 94 Volume 3 : Issue 36
-
- Today's Topics:
-
- Apple Event example
- Faster Square Root Algorithm
- How to save alpha in PICT files?
- Open Transport & ASLM
- Sending the Mac Screen Image
- What are Universal Headers?
-
-
-
- The Comp.Sys.Mac.Programmer Digest is moderated by Francois Pottier
- (pottier@clipper.ens.fr).
-
- The digest is a collection of article threads from the internet newsgroup
- comp.sys.mac.programmer. It is designed for people who read c.s.m.p. semi-
- regularly and want an archive of the discussions. If you don't know what a
- newsgroup is, you probably don't have access to it. Ask your systems
- administrator(s) for details. If you don't have access to news, you may
- still be able to post messages to the group by using a mail server like
- anon.penet.fi (mail help@anon.penet.fi for more information).
-
- Each issue of the digest contains one or more sets of articles (called
- threads), with each set corresponding to a 'discussion' of a particular
- subject. The articles are not edited; all articles included in this digest
- are in their original posted form (as received by our news server at
- nef.ens.fr). Article threads are not added to the digest until the last
- article added to the thread is at least two weeks old (this is to ensure that
- the thread is dead before adding it to the digest). Article threads that
- consist of only one message are generally not included in the digest.
-
- The digest is officially distributed by two means, by email and ftp.
-
- If you want to receive the digest by mail, send email to listserv@ens.fr
- with no subject and one of the following commands as body:
- help Sends you a summary of commands
- subscribe csmp-digest Your Name Adds you to the mailing list
- signoff csmp-digest Removes you from the list
- Once you have subscribed, you will automatically receive each new
- issue as it is created.
-
- The official ftp info is //ftp.dartmouth.edu/pub/csmp-digest.
- Questions related to the ftp site should be directed to
- scott.silver@dartmouth.edu. Currently no previous volumes of the CSMP
- digest are available there.
-
- Also, the digests are available to WAIS users. To search back issues
- with WAIS, use comp.sys.mac.programmer.src. With Mosaic, use
- http://www.wais.com/wais-dbs/comp.sys.mac.programmer.html.
-
-
- -------------------------------------------------------
-
- >From farago@onyx.ill.fr (Bela Farago)
- Subject: Apple Event example
- Date: Thu, 2 Jun 94 13:06:18 GMT
- Organization: ILL - Grenoble FRANCE
-
- Hi,
-
- please forgive me if it is trivial but I am programming just by hobby (well
- not really but this is not my main job)
- I need to write a rather trivial code resource which sends the equivalent of
- the following AppleScript
-
- tell application anApplication
- save document 1
- end tell
-
- An Application is an application :)
-
- It is hard to digest the volume VI of IM. and I could not find simple
- example on ftp.apple.com. (I did find some but were not very clear to me) I
- am bit lost in descriptors and direct objects etc. Could anybody give me
- pointers to simple examples?
- Bela Farago (farago@ill.fr)
-
- ps: No flames please, you can always just ignore this message.
-
- +++++++++++++++++++++++++++
-
- >From greer@utdallas.edu (Dale M. Greer)
- Date: 2 Jun 1994 16:54:39 GMT
- Organization: The University of Texas at Dallas
-
- Bela Farago (farago@onyx.ill.fr) wrote:
- > Hi,
-
- > please forgive me if it is trivial but I am programming just by hobby (well
- > not really but this is not my main job)
- > I need to write a rather trivial code resource which sends the equivalent of
- > the following AppleScript
-
- > tell application anApplication
- > save document 1
- > end tell
-
- > An Application is an application :)
-
- > It is hard to digest the volume VI of IM. and I could not find simple
- > example on ftp.apple.com. (I did find some but were not very clear to me) I
- > am bit lost in descriptors and direct objects etc. Could anybody give me
- > pointers to simple examples?
- > Bela Farago (farago@ill.fr)
-
- > ps: No flames please, you can always just ignore this message.
-
-
- Here's how I did it.
-
- /**************************************************** MyCreateDocContainer */
- // 07-APR-94 : Dale M. Greer
-
- #pragma segment AEvents
- void MyCreateDocContainer(AEDesc *myDocContainer, char *docName)
- {
- AEDesc myDocDescRec, nullDescRec;
- OSErr err;
-
- // Prepare to make something from nothing
- err = AECreateDesc(typeNull, nil, 0, &nullDescRec);
-
- // This document container just contains the document name
- err = AECreateDesc(typeChar, docName, strlen(docName), &myDocDescRec);
-
- // Encapsulate the object
- err = CreateObjSpecifier((DescType)cDocument, &nullDescRec,
- (DescType)formName, &myDocDescRec, true, myDocContainer);
-
- AEDisposeDesc(&nullDescRec);
- }
-
- /********************************************************** SaveDocumentAs */
- // 07-APR-94 : Dale M. Greer
-
- #pragma segment AEvents
- void SaveDocumentAs(AEAddressDesc *theAddress, char *docName, char *saveName)
- {
- OSErr err;
- AppleEvent appleEvent, reply;
- AEDesc myDocDescRec, theName;
-
- err = AECreateAppleEvent(kAECoreSuite, kAESave, theAddress,
- kAutoGenerateReturnID, 1L, &appleEvent);
-
- // Make the doc container
- MyCreateDocContainer(&myDocDescRec, docName);
-
- // Append it to the appleEvent
- AEPutParamDesc(&appleEvent, keyDirectObject, &myDocDescRec);
-
- // Append the new filename descriptor to the appleEvent
- AECreateDesc(typeChar, saveName, strlen(saveName), &theName);
- AEPutParamDesc(&appleEvent, keyAEFile, &theName);
-
- // Send it to the Application
- err = AESend(&appleEvent, &reply, kAEWaitReply + kAENeverInteract,
- kAENormalPriority, kAEDefaultTimeout,
- (IdleProcPtr) MyIdleFunction, nil);
-
- AEDisposeDesc(&myDocDescRec);
- AEDisposeDesc(&appleEvent);
- AEDisposeDesc(&reply);
- }
-
-
- You get the address of the target application when you launch it, or if it
- is already launched, use the following routine, where "creator" is the
- creator type of your application. For example, this creator type is
- 'XCEL' for Excel, 'ttxt' for TeachText, etc.
-
- // This function originated from James "im" Beninghaus of Apple DTS
- /*********************************************************** AppLaunched */
- // Find the process serial number of your application
-
- #pragma segment Launch
- Boolean AppLaunched(AEAddressDesc *theAddress, OSType creator,
- ProcessSerialNumber *PSN)
- {
- OSErr osErr;
- ProcessSerialNumber process;
- ProcessInfoRec procInfo;
- Str255 procName;
- FSSpec procSpec;
-
- osErr = noErr;
- process.highLongOfPSN = 0;
- process.lowLongOfPSN = 0;
-
- procInfo.processInfoLength = sizeof(procInfo);
- procInfo.processName = procName;
- procInfo.processAppSpec = &procSpec;
-
- while (procNotFound != (osErr = GetNextProcess(&process))) {
- if (noErr == (osErr = GetProcessInformation(&process, &procInfo))) {
- if (procInfo.processSignature == creator) {
- AECreateDesc(typeProcessSerialNumber, (Ptr)&process,
- sizeof(ProcessSerialNumber), theAddress);
- *PSN = process;
- return(true);
- }
- }
- }
- return(false);
- }
-
- In summary, the program snippet accomplishing what you desire would
- look like this.
-
- if (AppLaunched(&theAddress, theCreator, &PSN) {
- SaveDocumentAs(&theAddress, &docName, &saveName);
- }
-
- If the application is not already launched, there are a number of ways
- to launch it, but I will leave that as an exercise for the reader, or
- for a later tutorial if you're still having problems.
-
- --
-
- Dale Greer, greer@utdallas.edu
- "They know that it is human nature to take up causes whereby a man may
- oppress his neighbor, no matter how unjustly. ... Hence they have had
- no trouble in finding men who would preach the damnability and heresy
- of the new doctrine from the very pulpit..." - Galileo Galilei, 1615
-
-
- ---------------------------
-
- >From busfield@hurricane.seas.ucla.edu (John D. Busfield)
- Subject: Faster Square Root Algorithm
- Date: Tue, 3 May 1994 17:23:52 GMT
- Organization: School of Engineering & Applied Science, UCLA.
-
- Does anyone have an algorithm or code for computing square roots.
- The emphasis is on speed rather than accuracy in that I only need the
- result rounded to the nearest integer (ie sqrt(1000) = 31 or 32).
- Thanks in advance.
-
- John Busfield
- busfield@hurricane.seas.ucla.edu
-
-
- +++++++++++++++++++++++++++
-
- >From rmah@panix.com (Robert S. Mah)
- Date: Tue, 03 May 1994 17:41:55 -0500
- Organization: One Step Beyond
-
- busfield@hurricane.seas.ucla.edu (John D. Busfield) wrote:
-
- > Does anyone have an algorithm or code for computing square roots.
- > The emphasis is on speed rather than accuracy in that I only need the
- > result rounded to the nearest integer (ie sqrt(1000) = 31 or 32).
-
- There is an article on generating very fast square root aproximations
- in the book "Graphics Gems", Ed. Glassner.
-
- It generates a lookup table which is indexed by munging the exponent
- of the argument and then uses the mantissa to do an (linear?)
- aproximation to the final result. It's fairly accurate to within a few
- decimal points. I think the source code is also available somewhere
- on the net, but I'm afraid I don't know where.
-
- Another option, since you only want integer aproximations, is to create a
- table of squares and do a binary search over it. A 1000 element table
- will give you a range of 1->1,000,000 with an average of log2(1000) = 10
- lookups using a simple binary search. A table with 32K or so elements will
-
- have a range of 1->1,000,000,000 and will be searchable in log2(32K) = 30
- lookups. If you can spare a few dozen K, then this may be fast enough.
-
- Cheers,
- Rob
- ___________________________________________________________________________
- Robert S. Mah -=- One Step Beyond -=- 212-947-6507 -=- rmah@panix.com
-
- +++++++++++++++++++++++++++
-
- >From gewekean@studentg.msu.edu (Andrew Geweke)
- Date: Tue, 3 May 1994 20:56:35 -0400
- Organization: Michigan State University
-
- In article <rmah-030594174155@rmah.dialup.access.net>, rmah@panix.com (Robert
- S. Mah) writes:
- > Another option, since you only want integer aproximations, is to create
- > a table of squares and do a binary search over it. A 1000 element
- > table will give you a range of 1->1,000,000 with an average of log2
- > (1000) = 10 lookups using a simple binary search. A table with 32K or
- > so elements will
- >
- > have a range of 1->1,000,000,000 and will be searchable in log2(32K) = 30
- > lookups. If you can spare a few dozen K, then this may be fast enough.
-
- Actually, I once found this algorithm:
-
- int intSqrt (int num) {
- for (i = j = 1, k = 3; num > j; j += (k+=2), ++i)
- ;
-
- return i;
- }
-
- Note that this algorithm rounds up, not down. I'm not sure exactly how
- correct this is (I just coded it off the top of my head) but you get the
- basic idea in any case. Adding the odd integers gives you the squares; i.e. 1
- = 1; 1 + 3 = 4; 1 + 3 + 5 = 9; 1 + 3 + 5 + 7 = 16; 1 + 3 + 5 + 7 + 9 = 25;
- and so on.
-
- I don't know how fast this is in your specific instance, but it should be
- quite fast.
-
-
-
-
-
- +++++++++++++++++++++++++++
-
- >From rmah@panix.com (Robert S. Mah)
- Date: Tue, 03 May 1994 22:04:22 -0500
- Organization: One Step Beyond
-
- gewekean@studentg.msu.edu (Andrew Geweke) wrote:
-
- > rmah@panix.com (Robert S. Mah) writes:
- > > Another option, since you only want integer aproximations, is to create
- > > a table of squares and do a binary search over it. A 1000 element
- > > [...]
- > > have a range of 1->1,000,000,000 and will be searchable in log2(32K) = 30
- > > lookups. If you can spare a few dozen K, then this may be fast enough.
- >
- > Actually, I once found this algorithm:
- >
- > int intSqrt (int num) {
- > for (i = j = 1, k = 3; num > j; j += (k+=2), ++i)
- > ;
- >
- > return i;
- > }
- >
- > Note that this algorithm rounds up, not down. I'm not sure exactly how
- > correct this is (I just coded it off the top of my head) but you get the
- > basic idea in any case. Adding the odd integers gives you the squares;
- > i.e. 1 = 1; 1 + 3 = 4; 1 + 3 + 5 = 9; 1 + 3 + 5 + 7 = 16; 1 + 3 + 5 +
- > 7 + 9 = 25; and so on.
- >
- > I don't know how fast this is in your specific instance, but it should
- > be quite fast.
-
- That's a great little algorithm! That sequence looks _so_ familiar, but
- for the life of me, I can't recall the name...
-
- Anyway, it looks like it would be order O(sqrt(n)) where n is the argument
- to the function. Only a problem if the domain range for the application
- is large, that is Sqrt(10000) would take 10 times as long as Sqrt(100).
- However, unlike the table method, the upper bounds isn't fixed at compile
- time.
-
- Cheers,
- Rob
- ___________________________________________________________________________
- Robert S. Mah -=- One Step Beyond -=- 212-947-6507 -=- rmah@panix.com
-
- +++++++++++++++++++++++++++
-
- >From pottier@corvette.ens.fr (Francois Pottier)
- Date: 4 May 1994 09:42:29 GMT
- Organization: Ecole Normale Superieure, PARIS, France
-
- In article <busfield.767985832@hurricane>,
- John D. Busfield <busfield@hurricane.seas.ucla.edu> wrote:
-
- > Does anyone have an algorithm or code for computing square roots.
-
-
- There was a thread on this subject about a year ago. Here's what
- I kept:
-
- - -----------------------------------------------------------------------
-
-
- >Does somebody have code or an algorithm for extracting a long integer
- >square root and returning an integer?
-
- There was a long series of letters in Dr. Dobbs' Journal a few years
- back that I was a part of. Here are two 'competing' subroutines for the
- 68000 that I wrote. One is a Newton-Raphson iterator, the other a
- hybrid of three different subroutines using two different methods,
- heavily optimized for speed. The non-Newton one is faster in some cases
- and slower in others. Choosing one as 'best' depends on what kind of
- input ranges you expect to see most often. Newton's time doesn't vary
- much on the average based on the input argument. The non-Newton one
- ranges from a lot better (small arguments) to a little worse (large
- arguments), but is always better than Newton's worst case. Don't
- neglect the fact that on a FPU-equipped machine it may be faster to use
- floating point. I've done no research into this possibility, nor have I
- timed this stuff on any 68020+ system. The speed balance is no doubt
- different then. (For that matter, the 68000 testbed had no wait states
- nor interrupt system overhead, which doesn't necessarily apply to a
- 68000 PowerBook, and certainly doesn't apply to a Mac+ or earlier.)
-
- I'm personally fond of the non-Newton version, because the algorithm
- only uses shifts and adds, so it could be implemented in microcode with
- about same speed as a divide.
-
- *************************************************************************
- * *
- * Integer Square Root (32 to 16 bit). *
- * *
- * (Newton-Raphson method). *
- * *
- * Call with: *
- * D0.L = Unsigned number. *
- * *
- * Returns: *
- * D0.L = SQRT(D0.L) *
- * *
- * Notes: Result fits in D0.W, but is valid in longword. *
- * Takes from 338 cycles (1 shift, 1 division) to *
- * 1580 cycles (16 shifts, 4 divisions) (including rts). *
- * Averages 854 cycles measured over first 65535 roots. *
- * Averages 992 cycles measured over first 500000 roots. *
- * *
- *************************************************************************
-
- .globl lsqrt
- * Cycles
- lsqrt movem.l d1-d2,-(sp) (24)
- move.l d0,d1 (4) ; Set up for guessing algorithm.
- beq.s return (10/8) ; Don't process zero.
- moveq #1,d2 (4)
-
- guess cmp.l d2,d1 (6) ; Get a guess that is guaranteed to be
- bls.s newton (10/8) ; too high, but not by much, by dividing the
- add.l d2,d2 (8) ; argument by two and multiplying a 1 by 2
- lsr.l #1,d1 (10) ; until the power of two passes the modified
- bra.s guess (10) ; argument, then average these two numbers.
-
- newton add.l d1,d2 (8) ; Average the two guesses.
- lsr.l #1,d2 (10)
- move.l d0,d1 (4) ; Generate the next approximation(s)
- divu d2,d1 (140) ; via the Newton-Raphson method.
- bvs.s done (10/8) ; Handle out-of-range input (cheats!)
- cmp.w d1,d2 (4) ; Have we converged?
- bls.s done (10/8)
- swap d1 (4) ; No, kill the remainder so the
- clr.w d1 (4) ; next average comes out right.
- swap d1 (4)
- bra.s newton (10)
-
- done clr.w d0 (4) ; Return a word answer in a longword.
- swap d0 (4)
- move.w d2,d0 (4)
- return movem.l (sp)+,d1-d2 (28)
- rts (16)
-
- end
- *************************************************************************
- * *
- * Integer Square Root (32 to 16 bit). *
- * *
- * (Exact method, not approximate). *
- * *
- * Call with: *
- * D0.L = Unsigned number. *
- * *
- * Returns: *
- * D0.L = SQRT(D0.L) *
- * *
- * Notes: Result fits in D0.W, but is valid in longword. *
- * Takes from 122 to 1272 cycles (including rts). *
- * Averages 610 cycles measured over first 65535 roots. *
- * Averages 1104 cycles measured over first 500000 roots. *
- * *
- *************************************************************************
-
- .globl lsqrt
- * Cycles
- lsqrt tst.l d0 (4) ; Skip doing zero.
- beq.s done (10/8)
- cmp.l #$10000,d0 (14) ; If is a longword, use the long routine.
- bhs.s glsqrt (10/8)
- cmp.w #625,d0 (8) ; Would the short word routine be quicker?
- bhi.s gsqrt (10/8) ; No, use general purpose word routine.
- * ; Otherwise fall into special routine.
- *
- * For speed, we use three exit points.
- * This is cheesy, but this is a speed-optimized subroutine!
-
- page
- *************************************************************************
- * *
- * Faster Integer Square Root (16 to 8 bit). For small arguments. *
- * *
- * (Exact method, not approximate). *
- * *
- * Call with: *
- * D0.W = Unsigned number. *
- * *
- * Returns: *
- * D0.W = SQRT(D0.W) *
- * *
- * Notes: Result fits in D0.B, but is valid in word. *
- * Takes from 72 (d0=1) to 504 (d0=625) cycles *
- * (including rts). *
- * *
- * Algorithm supplied by Motorola. *
- * *
- *************************************************************************
-
- * Use the theorem that a perfect square is the sum of the first
- * sqrt(arg) number of odd integers.
-
- * Cycles
- move.w d1,-(sp) (8)
- move.w #-1,d1 (8)
- qsqrt1 addq.w #2,d1 (4)
- sub.w d1,d0 (4)
- bpl qsqrt1 (10/8)
- asr.w #1,d1 (8)
- move.w d1,d0 (4)
- move.w (sp)+,d1 (12)
- done rts (16)
-
- page
- *************************************************************************
- * *
- * Integer Square Root (16 to 8 bit). *
- * *
- * (Exact method, not approximate). *
- * *
- * Call with: *
- * D0.W = Unsigned number. *
- * *
- * Returns: *
- * D0.L = SQRT(D0.W) *
- * *
- * Uses: D1-D4 as temporaries -- *
- * D1 = Error term; *
- * D2 = Running estimate; *
- * D3 = High bracket; *
- * D4 = Loop counter. *
- * *
- * Notes: Result fits in D0.B, but is valid in word. *
- * *
- * Takes from 544 to 624 cycles (including rts). *
- * *
- * Instruction times for branch-type instructions *
- * listed as (X/Y) are for (taken/not taken). *
- * *
- *************************************************************************
-
- * Cycles
- gsqrt movem.l d1-d4,-(sp) (40)
- move.w #7,d4 (8) ; Loop count (bits-1 of result).
- clr.w d1 (4) ; Error term in D1.
- clr.w d2 (4)
- sqrt1 add.w d0,d0 (4) ; Get 2 leading bits a time and add
- addx.w d1,d1 (4) ; into Error term for interpolation.
- add.w d0,d0 (4) ; (Classical method, easy in binary).
- addx.w d1,d1 (4)
- add.w d2,d2 (4) ; Running estimate * 2.
- move.w d2,d3 (4)
- add.w d3,d3 (4)
- cmp.w d3,d1 (4)
- bls.s sqrt2 (10/8) ; New Error term > 2* Running estimate?
- addq.w #1,d2 (4) ; Yes, we want a '1' bit then.
- addq.w #1,d3 (4) ; Fix up new Error term.
- sub.w d3,d1 (4)
- sqrt2 dbra d4,sqrt1 (10/14) ; Do all 8 bit-pairs.
- move.w d2,d0 (4)
- movem.l (sp)+,d1-d4 (44)
- rts (16)
-
- page
- *************************************************************************
- * *
- * Integer Square Root (32 to 16 bit). *
- * *
- * (Exact method, not approximate). *
- * *
- * Call with: *
- * D0.L = Unsigned number. *
- * *
- * Returns: *
- * D0.L = SQRT(D0.L) *
- * *
- * Uses: D1-D4 as temporaries -- *
- * D1 = Error term; *
- * D2 = Running estimate; *
- * D3 = High bracket; *
- * D4 = Loop counter. *
- * *
- * Notes: Result fits in D0.W, but is valid in longword. *
- * *
- * Takes from 1080 to 1236 cycles (including rts.) *
- * *
- * Two of the 16 passes are unrolled from the loop so that *
- * quicker instructions may be used where there is no *
- * danger of overflow (in the early passes). *
- * *
- * Instruction times for branch-type instructions *
- * listed as (X/Y) are for (taken/not taken). *
- * *
- *************************************************************************
-
- * Cycles
- glsqrt movem.l d1-d4,-(sp) (40)
- moveq #13,d4 (4) ; Loop count (bits-1 of result).
- moveq #0,d1 (4) ; Error term in D1.
- moveq #0,d2 (4)
- lsqrt1 add.l d0,d0 (8) ; Get 2 leading bits a time and add
- addx.w d1,d1 (4) ; into Error term for interpolation.
- add.l d0,d0 (8) ; (Classical method, easy in binary).
- addx.w d1,d1 (4)
- add.w d2,d2 (4) ; Running estimate * 2.
- move.w d2,d3 (4)
- add.w d3,d3 (4)
- cmp.w d3,d1 (4)
- bls.s lsqrt2 (10/8) ; New Error term > 2* Running estimate?
- addq.w #1,d2 (4) ; Yes, we want a '1' bit then.
- addq.w #1,d3 (4) ; Fix up new Error term.
- sub.w d3,d1 (4)
- lsqrt2 dbra d4,lsqrt1 (10/14) ; Do first 14 bit-pairs.
-
- add.l d0,d0 (8) ; Do 15-th bit-pair.
- addx.w d1,d1 (4)
- add.l d0,d0 (8)
- addx.l d1,d1 (8)
- add.w d2,d2 (4)
- move.l d2,d3 (4)
- add.w d3,d3 (4)
- cmp.l d3,d1 (6)
- bls.s lsqrt3 (10/8)
- addq.w #1,d2 (4)
- addq.w #1,d3 (4)
- sub.l d3,d1 (8)
-
- lsqrt3 add.l d0,d0 (8) ; Do 16-th bit-pair.
- addx.l d1,d1 (8)
- add.l d0,d0 (8)
- addx.l d1,d1 (8)
- add.w d2,d2 (4)
- move.l d2,d3 (4)
- add.l d3,d3 (8)
- cmp.l d3,d1 (6)
- bls.s lsqrt4 (10/8)
- addq.w #1,d2 (4)
- lsqrt4 move.w d2,d0 (4)
- movem.l (sp)+,d1-d4 (44)
- rts (16)
-
- end
- --
- +----------------+
- ! II CCCCCC ! Jim Cathey
- ! II SSSSCC ! ISC-Bunker Ramo
- ! II CC ! TAF-C8; Spokane, WA 99220
- ! IISSSS CC ! UUCP: uunet!isc-br!jimc (jimc@isc-br.isc-br.com)
- ! II CCCCCC ! (509) 927-5757
- +----------------+
-
- - -----------------------------------------------------------------------------
-
- /*
- * ISqrt--
- * Find square root of n. This is a Newton's method approximation,
- * converging in 1 iteration per digit or so. Finds floor(sqrt(n) + 0.5).
- */
- int ISqrt(register int n)
- {
- register int i;
- register long k0, k1, nn;
-
- for (nn = i = n, k0 = 2; i > 0; i >>= 2, k0 <<= 1)
- ;
- nn <<= 2;
- for (;;) {
- k1 = (nn / k0 + k0) >> 1;
- if (((k0 ^ k1) & ~1) == 0)
- break;
- k0 = k1;
- }
- return (int) ((k1 + 1) >> 1);
- }
-
- --
- Joseph Nathan Hall | Joseph's Rule of Thumb: Most folks aren't interested
- Software Architect | in your rules of thumb.
- Gorca Systems Inc. | ----- joseph@joebloe.maple-shade.nj.us (home) -----
- (on assignment) | (602) 732-2549 jnhall@sat.mot.com (work)
-
- - -----------------------------------------------------------------------------
-
- Here's an article I saved from c.s.m.p four months ago. Strangely, it
- was only distributed in North America. I don't know how fast it works
- in practice, but there are no multiplications or divisions in its inner
- loop, which is promising.
-
-
- #include <math.h>
-
- /*
- * It requires more space to describe this implementation of the manual
- * square root algorithm than it did to code it. The basic idea is that
- * the square root is computed one bit at a time from the high end. Because
- * the original number is 32 bits (unsigned), the root cannot exceed 16 bits
- * in length, so we start with the 0x8000 bit.
- *
- * Let "x" be the value whose root we desire, "t" be the square root
- * that we desire, and "s" be a bitmask. A simple way to compute
- * the root is to set "s" to 0x8000, and loop doing the following:
- *
- * t = 0;
- * s = 0x8000;
- * do {
- * if ((t + s) * (t + s) <= x)
- * t += s;
- * s >>= 1;
- * while (s != 0);
- *
- * The primary disadvantage to this approach is the multiplication. To
- * eliminate this, we begin simplying. First, we observe that
- *
- * (t + s) * (t + s) == (t * t) + (2 * t * s) + (s * s)
- *
- * Therefore, if we redefine "x" to be the original argument minus the
- * current value of (t * t), we can determine if we should add "s" to
- * the root if
- *
- * (2 * t * s) + (s * s) <= x
- *
- * If we define a new temporary "nr", we can express this as
- *
- * t = 0;
- * s = 0x8000;
- * do {
- * nr = (2 * t * s) + (s * s);
- * if (nr <= x) {
- * x -= nr;
- * t += s;
- * }
- * s >>= 1;
- * while (s != 0);
- *
- * We can improve the performance of this by noting that "s" is always a
- * power of two, so multiplication by "s" is just a shift. Also, because
- * "s" changes in a predictable manner (shifted right after each iteration)
- * we can precompute (0x8000 * t) and (0x8000 * 0x8000) and then adjust
- * them by shifting after each step. First, we let "m" hold the value
- * (s * s) and adjust it after each step by shifting right twice. We
- * also introduce "r" to hold (2 * t * s) and adjust it after each step
- * by shifting right once. When we update "t" we must also update "r",
- * and we do so by noting that (2 * (old_t + s) * s) is the same as
- * (2 * old_t * s) + (2 * s * s). Noting that (s * s) is "m" and that
- * (r + 2 * m) == ((r + m) + m) == (nr + m):
- *
- * t = 0;
- * s = 0x8000;
- * m = 0x40000000;
- * r = 0;
- * do {
- * nr = r + m;
- * if (nr <= x) {
- * x -= nr;
- * t += s;
- * r = nr + m;
- * }
- * s >>= 1;
- * r >>= 1;
- * m >>= 2;
- * } while (s != 0);
- *
- * Finally, we note that, if we were using fractional arithmetic, after
- * 16 iterations "s" would be a binary 0.5, so the value of "r" when
- * the loop terminates is (2 * t * 0.5) or "t". Because the values in
- * "t" and "r" are identical after the loop terminates, and because we
- * do not otherwise use "t" explicitly within the loop, we can omit it.
- * When we do so, there is no need for "s" except to terminate the loop,
- * but we observe that "m" will become zero at the same time as "s",
- * so we can use it instead.
- *
- * The result we have at this point is the floor of the square root. If
- * we want to round to the nearest integer, we need to consider whether
- * the remainder in "x" is greater than or equal to the difference
- * between ((r + 0.5) * (r + 0.5)) and (r * r). Noting that the former
- * quantity is (r * r + r + 0.25), we want to check if the remainder is
- * greater than or equal to (r + 0.25). Because we are dealing with
- * integers, we can't have equality, so we round up if "x" is strictly
- * greater than "r":
- *
- * if (x > r)
- * r++;
- */
-
- unsigned int isqrt(unsigned long x)
- {
- unsigned long r, nr, m;
-
- r = 0;
- m = 0x40000000;
- do {
- nr = r + m;
- if (nr <= x) {
- x -= nr;
- r = nr + m;
- }
- r >>= 1;
- m >>= 2;
- } while (m != 0);
-
- if (x > r)
- r++;
- return(r);
- }
- --
- (Dr.) John Bruner, Deputy Director bruner@csrd.uiuc.edu
- Center for Supercomputing Research & Development (217) 244-4476 (voice)
- University of Illinois at Urbana-Champaign (217) 244-1351 (FAX)
- 465 CSRL, MC-264; 1308 West Main St.; Urbana, IL 61801-2307
-
-
- --
- Jamie McCarthy Internet: k044477@kzoo.edu AppleLink: j.mccarthy
-
- - -----------------------------------------------------------------------------
-
- Then there is the easy way - use the toolbox:
-
- #define LongSqrt(x) ((short)(FracSqrt((Fract)(x)) >> 15))
-
- FractSqrt is in FixMath.h. This works because given two fixed point number
- of the form x.y, where x is the number of bits before the decimal point
- and y is the number of bits after the decimal point, twos complement
- multiplication will yield a result of: x1.y1 * x2.y2 = x1+x2.y1+y2. If the
- numbers are the same format, this becomes x.y * x.y = 2x.2y. This holds
- for squaring a number also: x.y^2 = 2x.2y. The sqrt is then sqrt(x.y) =
- x/2.y/2. FracSqrt is of the form FracSqrt(2.30) = 2.30. Since we know how
- sqrt works this can also be expressed as FracSqrt(x.y) = x/2.y/2 << 15
- (This is a "virtual" shift since the calculation is compleated with more
- precision the 15 bits shifted in are "real" and not 0). So if you want to
- use FracSqrt for a long you get FracSqrt(32.0) = 16.0 << 15. So you shift
- the result down be 15 to get a short (you can add 1 << 14 prior to
- shifting down by 15 to round instead of truncing your result). If you
- wanted a long sqrt with a 16.16 (Fixed) result you would use:
-
- #define LongFixedSqrt(x) ((Fixed)(FracSqrt((Fract)(x)) << 1))
-
- If you want a Fixed (16.16) sqrt with a Fixed result (16.16) you would use:
-
- #define FixedSqrt(x) ((Fixed)FracSqrt((Fract)(x)) >> 7))
-
- You can do this kind of work with all of the fixed point routines and
- standard operators.
-
- Sean Parent
-
-
-
- --
- Francois Pottier pottier@dmi.ens.fr
- - ----------------------------------------------------------------------------
- This area dedicated to the preservation of endangered species. "Moof!"
-
- +++++++++++++++++++++++++++
-
- >From christer@cs.umu.se (Christer Ericson)
- Date: Wed, 4 May 1994 12:16:40 GMT
- Organization: Dept. of Computing Science, Umea Univ., 901 87 Umea, Sweden
-
- In <busfield.767985832@hurricane> busfield@hurricane.seas.ucla.edu (John D. Busfield) writes:
-
- > Does anyone have an algorithm or code for computing square roots.
- >The emphasis is on speed rather than accuracy in that I only need the
- >result rounded to the nearest integer (ie sqrt(1000) = 31 or 32).
- >Thanks in advance.
-
- I posted this to comp.sys.m68k some time ago, here it is again:
-
-
- unsigned short ce_quick_sqrt(n)
- register unsigned long n;
- {
- asm {
- ;-------------------------
- ;
- ; Routine : ISQRT; integer square root
- ; I/O parameters: d0.w = sqrt(d0.l)
- ; Registers used: d0-d2/a0
- ;
- ; This is a highly optimized integer square root routine, based
- ; on the iterative Newton/Babylonian method
- ;
- ; r(n + 1) = (r(n) + A / R(n)) / 2
- ;
- ; Verified over the full interval [0x0L,0xFFFFFFFFL]
- ;
- ; Some minor compromises have been made to make it perform well
- ; on the 68000 as well as 68020/030/040. This routine outperforms
- ; the best of all other algorithms I've seen (except for a table
- ; lookup).
- ;
- ; Copyright (c) Christer Ericson, 1993. All rights reserved.
- ;
- ; Christer Ericson, Dept. of Computer Science, University of Umea,
- ; S-90187 UMEA, SWEDEN. Internet: christer@cs.umu.se
- ;
-
- ;-------------------------
- ;
- ; Macintosh preamble since THINK C passes first register param
- ; in d7. Remove this section on any other machine
- ;
- move.l d7,d0
- ;-------------------------
- ;
- ; Actual integer square root routine starts here
- ;
- move.l d0,a0 ; save original input value for the iteration
- beq.s @exit ; unfortunately we must special case zero
- moveq #2,d1 ; init square root guess
- ;-------------------------
- ;
- ; Speed up the process of finding a power of two approximation
- ; to the actual square root by shifting an appropriate amount
- ; when the input value is large enough
- ;
- ; If input values are word only, this section can be removed
- ;
- move.l d0,d2
- swap d2 ; do [and.l 0xFFFF0000,d2] this way to...
- tst.w d2 ; go faster on 68000 and to avoid having to...
- beq.s @skip8 ; reload d2 for the next test below
- move.w #0x200,d1 ; faster than lsl.w #8,d1 (68000)
- lsr.l #8,d0
- @skip8 and.w #0xFE00,d2 ; this value and shift by 5 are magic
- beq.s @skip4
- lsl.w #5,d1
- lsr.l #5,d0
- @skip4
-
- ;-------------------------
- ;
- ; This finds the power of two approximation to the actual square root
- ; by doing the integer equivalent to sqrt(x) = 2 ^ (log2(x) / 2). This
- ; is done by shifting the input value down while shifting the root
- ; approximation value up until they meet in the middle. Better precision
- ; (in the step described below) is gained by starting the approximation
- ; at 2 instead of 1 (this means that the approximation will be a power
- ; of two too large so remember to shift it down).
- ;
- ; Finally (and here's the clever thing) since, by previously shifting the
- ; input value down, it has actually been divided by the root approximation
- ; value already so the first iteration is "for free". Not bad, eh?
- ;
- @loop add.l d1,d1
- lsr.l #1,d0
- cmp.l d0,d1
- bcs.s @loop
- @skip lsr.l #1,d1 ; adjust the approximation
- add.l d0,d1 ; here we just add and shift to...
- lsr.l #1,d1 ; get the first iteration "for free"!
- ;-------------------------
- ;
- ; The iterative root value is guaranteed to be larger than (or equal to)
- ; the actual square root, except for the initial guess. But since the first
- ; iteration was done above, the loop test can be simplified below
- ; (Oh, without the bvs.s the routine will fail on most large numbers, like
- ; for instance, 0xFFFF0000. Could there be a better way of handling these,
- ; so the bvs.s can be removed? Nah...)
- ;
- @loop2 move.l a0,d2 ; get original input value
- move.w d1,d0 ; save current guess
- divu.w d1,d2 ; do the Newton method thing
- bvs.s @exit ; if div overflows, exit with current guess
- add.w d2,d1
- roxr.w #1,d1 ; roxr ensures shifting back carry overflow
- cmp.w d0,d1
- bcs.s @loop2 ; exit with result in d0.w
- @exit
- }
- }
-
-
- Christer Ericson --- Internet: christer@cs.umu.se --- tel: +46-90-166794
- Department of Computer Science, University of Umea, S-90187 UMEA, SWEDEN
-
- +++++++++++++++++++++++++++
-
- >From usenet@lowry.eche.ualberta.ca (Brian Lowry)
- Date: 4 May 1994 16:38:23 GMT
- Organization: Dept. of Chemical Eng., University of Alberta
-
- Well, if people are going to start posting square root routines that use
- division, here's about the simplest possible, coded in C:
-
-
- #define precLim 1e-8 //for good single precision performance
-
- float sqrt(float x);
-
- float sqrt(float x)
- { register float d = 0.5*(1.0 - x);
- register float b = 1.0 - d;
-
- while ((d > precLim)||(d < precLim))
- { d *= 0.5*d/b; b -= d;}
-
- return b;
- }
-
-
- This takes about three to five iterations, but on the PowerPC that
- amounts to a whopping 100 to 170 clock cycles. Since that's enough cycles
- for about 80 multiplies and/or flops, division is definitely not the best
- route here (though it's certainly compact...)
-
- Brian Lowry
-
- +++++++++++++++++++++++++++
-
- >From parkb@bigbang.Stanford.EDU (Brian Park)
- Date: 4 May 1994 20:43:37 GMT
- Organization: Stanford University
-
- I am assuming that the original poster wanted integer in/ integer out
- since he only wanted accurary within 1.
-
- The following is a description of code which follows at the end of this
- article.
-
- lsqrt2() - traditional Newton's method
- - ------
- The traditional Newton's method of getting y = sqrt(n), coded as lsqrt2()
- below)
-
- n = y^2
- y(0) = n
- y(i+1) = (y(i) + n/y(i))/2
-
- is second-order convergent, meaning that the relative error in the i+1
- iteration is proportional to the square of the previous error
- (remember, at sufficiently large iteration, relative error << 1). [Do
- a Taylor series exansion to verify that the first order term
- vanishes]. Hence, for a fixed tolerance, in this case, within unity
- (ie. error of 1/n), the algorithm performs a bisection on the
- error at every iteration.
-
- My point is that Newton's method is O(log_2(n)) and you are not going
- to do much better than that. Binary lookup tables are O(log_2(n))
- efficient. The only thing you might save is an integer division
- compared to the Newton's method (see n/y(i) above). But it also globbles
- up memory and has fixed size. A hash table provides the fastest
- lookup table, but it takes up even more memory than the binary lookup
- table, and coding it can be more tricky. Do you really want to use up
- that much memory for a simple square root function?
-
- I should add that Newton's method is guaranteed to converge for all n.
-
- lsqrt1() - bit-shift method
- - -----
- There is yet another method (see lsqrt1() function below) that I dreamed
- up last night, and coded it quickly this morning. It uses bit-shifts
- to perform an integer square root by taking log_2(n), dividing it
- by 2 (hence doing a square root) and shifting it back. This routine is
- also O(log_2(n)) but has the peculiar property that it is _faster_
- for larger numbers. You can see why below. It has the disadvantage that
- it is quite inaccurate for small numbers, but surprisingly accurate for
- large numbers.
-
- lsqrt3() - combination
- - ------
- If you really need accurary within unity, then you can plug the result
- of the bit-shift method, lsqrt1(), use it as the initial guess for
- the Newton's method and converge the answer down to its limiting
- accurary. This combination is really really fast, since Newton's
- method converges within an iteration or two when the initial guess is
- very close to the actual answer. This has been coded as lsqrt3()
- below.
-
- For comparision, the straight Newton's method takes longer
- because its initial guess is simply the original answer, n.
- But let me empasize that even with straight Newton's method is O(log_2(n)),
- Sqrt(10^9) takes only 19 iterations, as you can verify. That's only 19
- integer divisions. How fast do you want your square root to be?
-
- I have even more ideas on how to speed things up, but I think the code
- below should get you going. I'm sorry it's not coded in Macintosh,
- I'm on the Unix box right now, and my mac's at home. It hasn't
- been tested extensively on the boundary points (ie. 0,1, 2^32-1),
- but you can do that.
-
- My final comment concerns the O(sqrt(n)) algorithm that was posted
- a short while ago. Although it is simple and cute, I would not recommend
- it when you have so many O(log_2(n)) routines to choose from.
-
- Regards,
- Brian Park
-
- - - cut for code below ---
-
- #include <stdio.h>
-
- #define BITSIZE (sizeof(unsigned long)*8)
-
- unsigned long lsqrtng();
- unsigned long lsqrt1();
- unsigned long lsqrt2();
- unsigned long lsqrt3();
-
- /***
- Usage:
- lsqrt n
-
- Example:
-
- lsqrt 100000
- l = 1000000
- bit shift: 976
- iter = 13
- newton: 1000
- iter = 2
- bit shift + newton: 1000
- ***/
- main(int argc, char **argv)
- {
- unsigned long l;
-
- l = atol(argv[1]);
- printf("l = %lu\n", l);
-
- printf("bit shift: %lu\n", lsqrt1(l));
- printf("newton: %lu\n", lsqrt2(l));
- printf("bit shift + newton: %lu\n", lsqrt3(l));
- }
-
- /***
- Take square root by performing bit-shifts, getting the log_2(a),
- dividing by 2, and shifting back.
- ***/
- unsigned long lsqrt1(unsigned long a)
- {
- register unsigned int i;
- register unsigned long l = a;
-
- if (a==0) return 0;
-
- for (i=0; l<(unsigned long)(1L<<(BITSIZE-1)) && i<BITSIZE; ++i)
- l<<=1;
-
- i = (32-i)>>1;
- a >>= i;
- return a;
- }
-
- /***
- Take square root by using Newton's method with an initial guess of itself.
- ***/
- unsigned long lsqrt2(unsigned long a)
- {
- return lsqrtng(a, a);
- }
-
- /***
- Newton's method with guess.
- Newton's method is second order convergence, hence it should be very fast.
- ***/
- unsigned long lsqrtng(register unsigned long a, unsigned long guess)
- {
- register unsigned long b, b0;
- int i = 0;
-
- if (a == 0) return 0;
-
- b = guess;
- b0 = a/b;
- while (labs(b-b0)>1) {
- /* printf("guess = %lu\n", b);*/
- ++i;
- b0 = b;
- b=(a/b+b)>>1;
- }
- printf("iter = %d\n", i);
- return b;
- }
-
- /***
- Combination Newton's method and bit-shifts.
- ***/
- unsigned long lsqrt3(unsigned long a)
- {
- return lsqrtng(a, lsqrt1(a));
- }
-
- +++++++++++++++++++++++++++
-
- >From sparent@mv.us.adobe.com (Sean Parent)
- Date: Wed, 4 May 1994 20:46:30 GMT
- Organization: Adobe Systems Incorporated
-
- In article <9405032056.AA35937@geweke.ppp.msu.edu>,
- gewekean@studentg.msu.edu (Andrew Geweke) wrote:
-
- > Actually, I once found this algorithm:
- >
- > int intSqrt (int num) {
- > for (i = j = 1, k = 3; num > j; j += (k+=2), ++i)
- > ;
- >
- > return i;
- > }
-
- This is a difference algorithim and similar algorithms can be used to solve
- most any polynomial expression. This is the method used by Charles Babage
- (sp?) in the late 1800's in his difference engine. For example: y = x^3 can
- be calculated over a range with a given step in x of n by:
-
- y = y` + a
-
- Where a = (x + n)^3 - x^3. Or a = 3nx^2 + 3xn^2 + n^3.
-
- The x^2 term (i) can be similary reduce:
- i = i` + b
- Where b = (x + n)^2 - x^2. Or b = 2xn + n^2.
-
- Finaly the x term is reduced as:
- x = x` + n
-
- This can be coded up as a loop consisting of only additions. As to the
- original question. To calculate a square root of an integer on a Mac, I
- would code it as:
-
- inline long LongSqrt(long x)
- {
- return FracSqrt(x) >> 15;
- }
-
- FracSqrt is in FixMath.h. This works because squaring a fixed point number
- doubles the number of fractional bits (so a number of type Fract, which is
- 2d30 multiplied by a Fract would yield a 4d60 number). Taking the square
- root of a number then halves the number of bits (to the original
- precision). So a strait square root of a fract would yield a 1d15 value.
- Apple's FracSqrt return a Fract so in effect it is calculating sqrt(x) <<
- 15 [with the bits shifted in actually being calculated]. So if you want a
- number of type Fixed as a result for the sqrt of an integer you could use.
-
- inline Fixed FixedSqrtLong(long x)
- {
- return FracSqrt(x) << 1; // 0/2 + 15 + 1 = 16
- }
-
- Or a the sqrt of a Fixed yielding a fixed is:
-
- inline Fixed FixedSqrt(Fixed x)
- {
- return FracSqrt(x) >> 7; // 16/2 + 15 - 7 = 16
- }
-
- This is a difficult thing to explain, I don't no of any good terminology
- for talking about fixed point numbers. Play with it some and it will make
- sense.
-
- --
- Sean Parent
-
- +++++++++++++++++++++++++++
-
- >From jimc@tau-ceti.isc-br.com (Jim Cathey)
- Date: Wed, 4 May 1994 18:04:33 GMT
- Organization: Olivetti North America, Spokane, WA
-
- In article <busfield.767985832@hurricane> busfield@hurricane.seas.ucla.edu (John D. Busfield) writes:
- > Does anyone have an algorithm or code for computing square roots.
- >The emphasis is on speed rather than accuracy in that I only need the
- >result rounded to the nearest integer (ie sqrt(1000) = 31 or 32).
-
- Here's two: A Newton's Method, and a cool bit-banger that's a little
- faster under certain circumstances. (Timings were all on a 68000
- as these are kinda old. Re-test to see what works best now.)
-
- *************************************************************************
- * *
- * Integer Square Root (32 to 16 bit). *
- * *
- * (Newton-Raphson method). *
- * *
- * Call with: *
- * D0.L = Unsigned number. *
- * *
- * Returns: *
- * D0.L = SQRT(D0.L) *
- * *
- * Notes: Result fits in D0.W, but is valid in longword. *
- * Takes from 338 cycles (1 shift, 1 division) to *
- * 1580 cycles (16 shifts, 4 divisions) (including rts). *
- * Averages 854 cycles measured over first 65535 roots. *
- * Averages 992 cycles measured over first 500000 roots. *
- * *
- *************************************************************************
-
- .globl lsqrt
- * Cycles
- lsqrt movem.l d1-d2,-(sp) (24)
- move.l d0,d1 (4) ; Set up for guessing algorithm.
- beq.s return (10/8) ; Don't process zero.
- moveq #1,d2 (4)
-
- guess cmp.l d2,d1 (6) ; Get a guess that is guaranteed to be
- bls.s newton (10/8) ; too high, but not by much, by dividing the
- add.l d2,d2 (8) ; argument by two and multiplying a 1 by 2
- lsr.l #1,d1 (10) ; until the power of two passes the modified
- bra.s guess (10) ; argument, then average these two numbers.
-
- newton add.l d1,d2 (8) ; Average the two guesses.
- lsr.l #1,d2 (10)
- move.l d0,d1 (4) ; Generate the next approximation(s)
- divu d2,d1 (140) ; via the Newton-Raphson method.
- bvs.s done (10/8) ; Handle out-of-range input (cheats!)
- cmp.w d1,d2 (4) ; Have we converged?
- bls.s done (10/8)
- swap d1 (4) ; No, kill the remainder so the
- clr.w d1 (4) ; next average comes out right.
- swap d1 (4)
- bra.s newton (10)
-
- done clr.w d0 (4) ; Return a word answer in a longword.
- swap d0 (4)
- move.w d2,d0 (4)
- return movem.l (sp)+,d1-d2 (28)
- rts (16)
-
- end
- *************************************************************************
- * *
- * Integer Square Root (32 to 16 bit). *
- * *
- * (Exact method, not approximate). *
- * *
- * Call with: *
- * D0.L = Unsigned number. *
- * *
- * Returns: *
- * D0.L = SQRT(D0.L) *
- * *
- * Notes: Result fits in D0.W, but is valid in longword. *
- * Takes from 122 to 1272 cycles (including rts). *
- * Averages 610 cycles measured over first 65535 roots. *
- * Averages 1104 cycles measured over first 500000 roots. *
- * *
- *************************************************************************
-
- .globl lsqrt
- * Cycles
- lsqrt tst.l d0 (4) ; Skip doing zero.
- beq.s done (10/8)
- cmp.l #$10000,d0 (14) ; If is a longword, use the long routine.
- bhs.s glsqrt (10/8)
- cmp.w #625,d0 (8) ; Would the short word routine be quicker?
- bhi.s gsqrt (10/8) ; No, use general purpose word routine.
- * ; Otherwise fall into special routine.
- *
- * For speed, we use three exit points.
- * This is cheesy, but this is a speed-optimized subroutine!
-
- page
- *************************************************************************
- * *
- * Faster Integer Square Root (16 to 8 bit). For small arguments. *
- * *
- * (Exact method, not approximate). *
- * *
- * Call with: *
- * D0.W = Unsigned number. *
- * *
- * Returns: *
- * D0.W = SQRT(D0.W) *
- * *
- * Notes: Result fits in D0.B, but is valid in word. *
- * Takes from 72 (d0=1) to 504 (d0=625) cycles *
- * (including rts). *
- * *
- * Algorithm supplied by Motorola. *
- * *
- *************************************************************************
-
- * Use the theorem that a perfect square is the sum of the first
- * sqrt(arg) number of odd integers.
-
- * Cycles
- move.w d1,-(sp) (8)
- move.w #-1,d1 (8)
- qsqrt1 addq.w #2,d1 (4)
- sub.w d1,d0 (4)
- bpl qsqrt1 (10/8)
- asr.w #1,d1 (8)
- move.w d1,d0 (4)
- move.w (sp)+,d1 (12)
- done rts (16)
-
- page
- *************************************************************************
- * *
- * Integer Square Root (16 to 8 bit). *
- * *
- * (Exact method, not approximate). *
- * *
- * Call with: *
- * D0.W = Unsigned number. *
- * *
- * Returns: *
- * D0.L = SQRT(D0.W) *
- * *
- * Uses: D1-D4 as temporaries -- *
- * D1 = Error term; *
- * D2 = Running estimate; *
- * D3 = High bracket; *
- * D4 = Loop counter. *
- * *
- * Notes: Result fits in D0.B, but is valid in word. *
- * *
- * Takes from 544 to 624 cycles (including rts). *
- * *
- * Instruction times for branch-type instructions *
- * listed as (X/Y) are for (taken/not taken). *
- * *
- *************************************************************************
-
- * Cycles
- gsqrt movem.l d1-d4,-(sp) (40)
- move.w #7,d4 (8) ; Loop count (bits-1 of result).
- clr.w d1 (4) ; Error term in D1.
- clr.w d2 (4)
- sqrt1 add.w d0,d0 (4) ; Get 2 leading bits a time and add
- addx.w d1,d1 (4) ; into Error term for interpolation.
- add.w d0,d0 (4) ; (Classical method, easy in binary).
- addx.w d1,d1 (4)
- add.w d2,d2 (4) ; Running estimate * 2.
- move.w d2,d3 (4)
- add.w d3,d3 (4)
- cmp.w d3,d1 (4)
- bls.s sqrt2 (10/8) ; New Error term > 2* Running estimate?
- addq.w #1,d2 (4) ; Yes, we want a '1' bit then.
- addq.w #1,d3 (4) ; Fix up new Error term.
- sub.w d3,d1 (4)
- sqrt2 dbra d4,sqrt1 (10/14) ; Do all 8 bit-pairs.
- move.w d2,d0 (4)
- movem.l (sp)+,d1-d4 (44)
- rts (16)
-
- page
- *************************************************************************
- * *
- * Integer Square Root (32 to 16 bit). *
- * *
- * (Exact method, not approximate). *
- * *
- * Call with: *
- * D0.L = Unsigned number. *
- * *
- * Returns: *
- * D0.L = SQRT(D0.L) *
- * *
- * Uses: D1-D4 as temporaries -- *
- * D1 = Error term; *
- * D2 = Running estimate; *
- * D3 = High bracket; *
- * D4 = Loop counter. *
- * *
- * Notes: Result fits in D0.W, but is valid in longword. *
- * *
- * Takes from 1080 to 1236 cycles (including rts.) *
- * *
- * Two of the 16 passes are unrolled from the loop so that *
- * quicker instructions may be used where there is no *
- * danger of overflow (in the early passes). *
- * *
- * Instruction times for branch-type instructions *
- * listed as (X/Y) are for (taken/not taken). *
- * *
- *************************************************************************
-
- * Cycles
- glsqrt movem.l d1-d4,-(sp) (40)
- moveq #13,d4 (4) ; Loop count (bits-1 of result).
- moveq #0,d1 (4) ; Error term in D1.
- moveq #0,d2 (4)
- lsqrt1 add.l d0,d0 (8) ; Get 2 leading bits a time and add
- addx.w d1,d1 (4) ; into Error term for interpolation.
- add.l d0,d0 (8) ; (Classical method, easy in binary).
- addx.w d1,d1 (4)
- add.w d2,d2 (4) ; Running estimate * 2.
- move.w d2,d3 (4)
- add.w d3,d3 (4)
- cmp.w d3,d1 (4)
- bls.s lsqrt2 (10/8) ; New Error term > 2* Running estimate?
- addq.w #1,d2 (4) ; Yes, we want a '1' bit then.
- addq.w #1,d3 (4) ; Fix up new Error term.
- sub.w d3,d1 (4)
- lsqrt2 dbra d4,lsqrt1 (10/14) ; Do first 14 bit-pairs.
-
- add.l d0,d0 (8) ; Do 15-th bit-pair.
- addx.w d1,d1 (4)
- add.l d0,d0 (8)
- addx.l d1,d1 (8)
- add.w d2,d2 (4)
- move.l d2,d3 (4)
- add.w d3,d3 (4)
- cmp.l d3,d1 (6)
- bls.s lsqrt3 (10/8)
- addq.w #1,d2 (4)
- addq.w #1,d3 (4)
- sub.l d3,d1 (8)
-
- lsqrt3 add.l d0,d0 (8) ; Do 16-th bit-pair.
- addx.l d1,d1 (8)
- add.l d0,d0 (8)
- addx.l d1,d1 (8)
- add.w d2,d2 (4)
- move.l d2,d3 (4)
- add.l d3,d3 (8)
- cmp.l d3,d1 (6)
- bls.s lsqrt4 (10/8)
- addq.w #1,d2 (4)
- lsqrt4 move.w d2,d0 (4)
- movem.l (sp)+,d1-d4 (44)
- rts (16)
-
- end
- --
- +----------------+
- ! II CCCCCC ! Jim Cathey
- ! II SSSSCC ! ISC-Bunker Ramo
- ! II CC ! TAF-C8; Spokane, WA 99220
- ! IISSSS CC ! UUCP: uunet!isc-br!jimc (jimc@isc-br.isc-br.com)
- ! II CCCCCC ! (509) 927-5757
-
- +++++++++++++++++++++++++++
-
- >From lasley@umdsp.umd.edu (Scott E. Lasley)
- Date: Wed, 4 May 1994 20:25:02 -0400
- Organization: Space Physics Group, University of Maryland
-
- In article <rmah-030594174155@rmah.dialup.access.net>, rmah@panix.com (Robert
- S. Mah) writes:
- > There is an article on generating very fast square root aproximations
- > in the book "Graphics Gems", Ed. Glassner.
- >
- > It generates a lookup table which is indexed by munging the exponent
- > of the argument and then uses the mantissa to do an (linear?)
- > aproximation to the final result. It's fairly accurate to within a
- > few decimal points. I think the source code is also available
- > somewhere on the net, but I'm afraid I don't know where.
-
- some (perhaps all) of the code in the Graphics Gems series of books can be
- found at ftp.princeton.edu. here is a URL
-
- ftp://ftp.princeton.edu//pub/Graphics/GraphicsGems
-
- there are directories for books I through IV. the ReadMe file in the
- GemsIII directory contains the following line:
-
- sqrt.c II.1 Steve Hill, "IEEE Fast Square Root"
-
- hope this helps,
-
-
- lasley@umdsp.umd.edu
- "... and there were bumper stickers saying "Free Love"
- like it was some kind of political prisoner..." David Wilcox
-
-
- +++++++++++++++++++++++++++
-
- >From nagle@netcom.com (John Nagle)
- Date: Thu, 5 May 1994 02:23:34 GMT
- Organization: NETCOM On-line Communication Services (408 241-9760 guest)
-
- busfield@hurricane.seas.ucla.edu (John D. Busfield) writes:
- > Does anyone have an algorithm or code for computing square roots.
- >The emphasis is on speed rather than accuracy in that I only need the
- >result rounded to the nearest integer (ie sqrt(1000) = 31 or 32).
- >Thanks in advance.
-
- The classic for floating point is to check the exponent and if odd,
- shift the mantissa right one bit. Then halve the exponent. Use the
- high 4 bits of the mantissa to select a starting mantissa from a table of
- 16 values. Then execute Newton's Method for two or three iterations,
- written out in line. On a machine with an FPU, this beats most iterative
- methods. Try it on a PPC. This may actually be faster than an
- iterative integer method.
-
- John Nagle
-
- +++++++++++++++++++++++++++
-
- >From Ron_Hunsinger@bmug.org (Ron Hunsinger)
- Date: Thu, 5 May 94 05:35:48 PST
- Organization: Berkeley Macintosh Users Group
-
- parkb@bigbang.Stanford.EDU (Brian Park) writes:
-
- >lsqrt2() - traditional Newton's method
- >--------
- >The traditional Newton's method of getting y = sqrt(n), coded as lsqrt2()
- >below)
- >
- >n = y^2
- >y(0) = n
- >y(i+1) = (y(i) + n/y(i))/2
- >
- >is second-order convergent,
-
- You can do square root much faster than this. Square root is about as
- complex, and takes about as much time, as a divide. A square root is
- like a division in which the quotient and divisor are equal.
- The usual division algorithm knows the divisor from the beginning, and
- discovers the quotient a bit at a time. You just have to tweak the
- algorithm a little so that it discovers the divisor and quotient in
- parallel, a bit at a time.
-
- Think about how you would calculate square root by hand. Then try
- to do it in binary. Notice how easy it becomes.
-
- >My point is that Newton's method is O(log_2(n)) and you are not going
- >to do much better than that.
-
- Newton's method is O(log n) ITERATIONS, but you do a divide in each
- iteration. If division takes O(log n) shift-subtracts, then you have
- a run time of O((log n)^2) shift-subtracts, compared to only O(log n)
- shift-subtracts using a modified divide algorithm.
-
- [Not that it really matters. You aren't writing an arbitrary-precision
- square root routine. You have a fixed upper bound on the size of the
- input, and ANY operation with only a finite number of valid inputs
- can be implemented in O(1) time.]
-
- [And if you have a very fast divide, as fast as a shift-subtract, as
- might happen if you use a hardware divide instruction on a RISC computer,
- then both methods might have nearly the same speed. But if shifting
- is faster than division, Newton's method is going to lose.]
-
- -Ron Hunsinger
-
- +++++++++++++++++++++++++++
-
- >From afcjlloyd@aol.com (AFC JLloyd)
- Date: 5 May 1994 12:37:02 -0400
- Organization: America Online, Inc. (1-800-827-6364)
-
- In article <2q91dp$rf@nntp2.Stanford.EDU>, parkb@bigbang.Stanford.EDU (Brian
- Park) writes:
-
- >>>
- But let me empasize that even with straight Newton's method is O(log_2(n)),
- Sqrt(10^9) takes only 19 iterations, as you can verify. That's only 19
- integer divisions. How fast do you want your square root to be?
- <<<
-
- If you simply provide a better starting estimate you can reduce your 19
- iterations
- down to a maximum of 5. A very cheap way to provide a better estimate is to
- find the most significant bit and use it to estimate log2(N) and thus sqrt(N).
- I don't know about you, but when I want fast code I'll take a factor of four
- anyway I can get.
-
- However, if you're really out for speed, it turns out that Newton's
- method is not the best. Yesterday,
- pottier@corvette.ens.fr (Francois Pottier) posted an algorithm
- by (Dr.) John Bruner, bruner@csrd.uiuc.edu:
-
- >>>>
- unsigned int isqrt(unsigned long x)
- {
- unsigned long r, nr, m;
-
- r = 0;
- m = 0x40000000;
- do {
- nr = r + m;
- if (nr <= x) {
- x -= nr;
- r = nr + m;
- }
- r >>= 1;
- m >>= 2;
- } while (m != 0);
-
- return(r);
- }
- <<<<
-
- (I've removed two lines of code that round the result up if
- rounding to nearest integer is desired).
- This is an incredibly great algorithm, which uses exactly
- 16 iterations but no integer multiply or divides.
-
- However, if you really want the fastest possible code,
- you should rewrite it like this:
-
- unsigned short isqrt(register unsigned long x)
- {
- register unsigned long r;
- register unsigned long nr;
- register unsigned long m;
-
- r = 0;
- m = 0x40000000;
- nr = m; if (nr <= x) { x -= nr; r = nr + m; } r >>= 1; m >>= 2;
- nr = r + m; if (nr <= x) { x -= nr; r = nr + m; } r >>= 1; m >>= 2;
- nr = r + m; if (nr <= x) { x -= nr; r = nr + m; } r >>= 1; m >>= 2;
- nr = r + m; if (nr <= x) { x -= nr; r = nr + m; } r >>= 1; m >>= 2;
- nr = r + m; if (nr <= x) { x -= nr; r = nr + m; } r >>= 1; m >>= 2;
- nr = r + m; if (nr <= x) { x -= nr; r = nr + m; } r >>= 1; m >>= 2;
- nr = r + m; if (nr <= x) { x -= nr; r = nr + m; } r >>= 1; m >>= 2;
- nr = r + m; if (nr <= x) { x -= nr; r = nr + m; } r >>= 1; m >>= 2;
- nr = r + m; if (nr <= x) { x -= nr; r = nr + m; } r >>= 1; m >>= 2;
- nr = r + m; if (nr <= x) { x -= nr; r = nr + m; } r >>= 1; m >>= 2;
- nr = r + m; if (nr <= x) { x -= nr; r = nr + m; } r >>= 1; m >>= 2;
- nr = r + m; if (nr <= x) { x -= nr; r = nr + m; } r >>= 1; m >>= 2;
- nr = r + m; if (nr <= x) { x -= nr; r = nr + m; } r >>= 1; m >>= 2;
- nr = r + m; if (nr <= x) { x -= nr; r = nr + m; } r >>= 1; m >>= 2;
- nr = r + m; if (nr <= x) { x -= nr; r = nr + m; } r >>= 1; m >>= 2;
- nr = r + m; if (nr <= x) { x -= nr; r = nr + m; } r >>= 1;
-
- return(r);
- }
-
- This code is 40% faster (on my 840av) than code that I have been
- using that used 4 iterations of Newton-Raphson after seeding
- along the lines I mentioned above.
-
- It may be possible to speed this up even further by using the
- fast estimate of Log2(N) to reduce the number of iterations
- from 16 to the absolute minimum (ceil(Log2(N)/2), though
- I expect the improvement to be minimal since each iteration of
- this algorithm is so cheap.
-
- Jim Lloyd
- afcjlloyd@aol.com
-
-
- +++++++++++++++++++++++++++
-
- >From trussler@panix.com (Tom Russler)
- Date: Fri, 06 May 1994 00:38:31 -0400
- Organization: PANIX Public Access Internet and UNIX (NYC)
-
- busfield@hurricane.seas.ucla.edu (John D. Busfield) writes:
- > Does anyone have an algorithm or code for computing square roots.
- >The emphasis is on speed rather than accuracy in that I only need the
- >result rounded to the nearest integer (ie sqrt(1000) = 31 or 32).
- >Thanks in advance.
-
- In 68000 assembly in THINK C:
-
- move.l number, d0
- move.l #1, d1
- move.l d0, d2
- @1 cmp.l d1, d2
- ble @2
- lsl.l #1, d1
- lsr.l #1, d2
- bra @1
- @2 add.l d2, d1
- lsr.l #1, d1
- divu d1, d0
- add.w d1, d0
- add.w #1, d0
- lsr.w #1, d0
-
- --
- Tom Russler
- trussler@panix.com
-
- +++++++++++++++++++++++++++
-
- >From zstern@adobe.com (Zalman Stern)
- Date: Fri, 6 May 1994 04:35:37 GMT
- Organization: Adobe Systems Incorporated
-
- Ron_Hunsinger@bmug.org (Ron Hunsinger) writes
- > [And if you have a very fast divide, as fast as a shift-subtract, as
- > might happen if you use a hardware divide instruction on a RISC computer,
-
- This is not even close to true for any RISC implementation I know of. If the
- divisor is loop-invariant, one can multiply by the reciprocal which may be a
- single cycle operation, however this is not the case in the example being
- discussed. Most RISC architecture now include a floating-point square root
- instruction. This is partially due to this operations similarity to division
- (i.e. can use the same hardware) and partially due to its common appearance
- in electrical CAD codes :-)
-
- I believe the integer square root routine I use is based on the algorithm
- Ron referred to. It has the following comment at the top (thanks to Mark
- Hamburg):
-
- /*
- * A square-root routine. See Dijkstra, "A Discipline of Programming",
- * page 65.
- */
-
- I threw in an optimization to use the PowerPC count-leading-zeros
- instruction to efficiently reduce the number of iterations. (There is a
- similar instruction on the 68K.) The result is competitive with a lookup
- table based method for computing sqrt(x^2 + y^2) that uses a divide. (I'd
- post the code, but I didn't write it and I'm sure Adobe would frown on my
- doing so. Besides, reading Dijkstra is good for you :-))
- --
- Zalman Stern zalman@adobe.com (415) 962 3824
- Adobe Systems, 1585 Charleston Rd., POB 7900, Mountain View, CA 94039-7900
- There is no lust like the present.
-
- +++++++++++++++++++++++++++
-
- >From hall_j@sat.mot.com (Joseph Hall)
- Date: Thu, 5 May 1994 21:55:49 GMT
- Organization: Motorola Inc., Satellite Communications
-
- Seems it was nagle@netcom.com (John Nagle) who said:
- >busfield@hurricane.seas.ucla.edu (John D. Busfield) writes:
- >> Does anyone have an algorithm or code for computing square roots.
- >>The emphasis is on speed rather than accuracy in that I only need the
- >>result rounded to the nearest integer (ie sqrt(1000) = 31 or 32).
- >>Thanks in advance.
- >
- > The classic for floating point is to check the exponent and if odd,
- >shift the mantissa right one bit. Then halve the exponent. [...]
-
- You can consult Plaugher's "The Standard C Library" for a portable
- floating point sqrt. For a positive real number, the algorithm is:
-
- 1) separate exponent and mantissa -- basically bit-masking except
- in cases of gradual underflow
- 2) compute a quadratic least-squares fit to sqrt for the mantissa
- (y = (-.1984742*x+.8804894)*x + .3176687)
- 3) follow with 3 iterations of Newton's method (y = .5*(y+x/y))
- 4) if exponent is odd, a) multiply mantissa by sqrt(2) and
- b) subtract 1 from exponent
- 5) divide exponent by 2
- 6) reassemble exponent and mantissa
-
- Right-shifting the mantissa for odd exponents before computing
- the root, to eliminate the multiplication by sqrt(2), is harder to
- express portably, but is certainly possible. Plaugher doesn't make
- any particular efforts to replace things like ".5 * x" with shifts.
-
- Handy little book. ISBN 0-13-131509-9.
-
- (begin digression)
-
- Plaugher uses conventional floating-point techniques to compute
- transcendental functions. This is pretty efficient, and these days
- may be as good as any technique. However, there are also algorithms
- that allow you to compute trigonometric functions, as well as exponent
- and log, by using only shifts and adds/subtracts--as if you were doing a
- divide. In fact, the computational effort is just twice that of
- a divide. I have seen relatively few references to these techniques,
- but they are easy to implement.
-
- For example, to compute an exponent, you need a table of ln(101(2)),
- ln(11(2)), ln(1(2)), ln(1.1(2)), ln(1.01(2)), ln(1.001(2)), etc.
- Then you can code a fixed-point version like this (please excuse typos
- since I can't paste this directly in here at the moment):
-
- unsigned long logTbl[] =
- { /* binary ln((2^i + 65536)/65536), for i == 32..1), e.g. */
- 0xa65b1, 0x9b441, 0x902d3 ...
- /* ln 2147549184/65536, ln 1073807360/65536, ln 536936448/65536 */ };
-
- unsigned long fexp(unsigned long x)
- {
- long i = 1L, j = 0L, k = 0x10000L, l = 0L, *lt = logTbl - 1;
- while (i <= 32) {
- j = l + lt[i];
- if (j > x) {
- i++;
- } else {
- l = j;
- k += k >> i;
- }
- }
- return k;
- }
-
- This computes the exponent iteratively in k. The log of the approximation
- is maintained in l. If l + the current entry in the table is <=
- than the value you are computing the exponent of, then you add the
- log in the table to l, and multiply k by a number that is conveniently
- in the form 1 + some power of 2. Repeat until the table is exhausted.
-
- These days this may not be any faster than using the coprocessor. However,
- this technique was occasionally used to write very high-performance libraries
- for 8- and early 16-bit processors. I doubt it was ever used in any
- commercial BASIC, though, because it is not well known, and BASIC would
- have been a hell of a lot faster for numerical work if it had been. I first
- heard this technique described in the mid 70s, but the literature goes
- back to at least the 60s.
-
- I wonder if this technique has ever been used to implement transcendental
- functions in microcode. At one time I was thinking of trying to get it
- working on a TI bit slice CPU, that is, build my own coprocessor. :-)
-
- --
- Joseph Nathan Hall | Joseph's Law of Interface Design: Never give your users
- Software Architect | a choice between the easy way and the right way.
- Gorca Systems Inc. | joseph@joebloe.maple-shade.nj.us (home)
- (on assignment) | (602) 732-2549 (work) Joseph_Hall-SC052C@email.mot.com
-
- +++++++++++++++++++++++++++
-
- >From Ron_Hunsinger@bmug.org (Ron Hunsinger)
- Date: Sat, 7 May 94 13:41:39 PST
- Organization: Berkeley Macintosh Users Group
-
- parkb@bigbang.Stanford.EDU (Brian Park) writes:
-
- >But let me empasize that even with straight Newton's method is O(log_2(n)),
- >Sqrt(10^9) takes only 19 iterations, as you can verify. That's only 19
- >integer divisions. How fast do you want your square root to be?
-
- I want it to be as fast as a single integer division. Why take 19 times
- as long as I have to?
-
-
-
- zstern@adobe.com (Zalman Stern) writes:
-
- >Ron_Hunsinger@bmug.org (Ron Hunsinger) writes
- >> [And if you have a very fast divide, as fast as a shift-subtract, as
- >> might happen if you use a hardware divide instruction on a RISC computer,
- >
- >This is not even close to true for any RISC implementation I know of.
-
- Glad to hear it. I only included the comment because I did not have
- instruction timings handy, and I was afraid someone was going to come
- along and say I needn't worry about avoiding divides because on the
- PowerPC (or some other RISC computer) EVERYTHING, including divide,
- took only a single cycle, and so divide was just as fast as shift.
-
- I knew shift couldn't be slower than divide, and if it really is faster
- (which is reasonable), then finding a square root algorithm that does
- not do any divides is a worthwhile objective.
-
- >I believe the integer square root routine I use is based on the algorithm
- >Ron referred to. It has the following comment at the top (thanks to Mark
- >Hamburg):
- >
- >/*
- > * A square-root routine. See Dijkstra, "A Discipline of Programming",
- > * page 65.
- > */
-
- A distinct possibility, since I read all the Dijkstra books I could get
- my hands on, way back when. However, I knew that square root could be
- done as quickly as division at least as far back as high school, before
- I had ever heard of Dijkstra (or seen a computer for that matter). As
- I said, the method suggests itself as soon as you assail to calculate
- square root using the traditional pencil and paper method in binary.
-
- >I threw in an optimization to use the PowerPC count-leading-zeros
- >instruction to efficiently reduce the number of iterations. (There is a
- >similar instruction on the 68K.)
-
- There is? Is this a 68040 instruction? I can't find any such instruction
- at least through the 68030. I thought the 68040 was just a 68030 with
- a bigger cache and a built-in FPU. Sure would be nice, though. The
- Unisys A-series computers have a similar instruction, accessible through
- the built-in FIRSTONE function in Burroughs Extended Algol, which I
- found many interesting (and not always obvious) uses for.
-
- >The result is competitive with a lookup
- >table based method for computing sqrt(x^2 + y^2) that uses a divide. (I'd
- >post the code, but I didn't write it and I'm sure Adobe would frown on my
- >doing so. Besides, reading Dijkstra is good for you :-))
-
- No problem. I'll post mine, which I wrote from scratch before checking
- your Dijkstra reference. My routine and his, neither of which uses divides
- (except by 2 or 4), bear only a vague similarity. (But reading Dijkstra
- is still good for you.) The whole thing is as fast as a single divide
- done by subroutine (as, for example, would be required to divide 32-bit
- integers on a 68000).
-
- Normally, a division subroutine will keep the divisor fixed and shift
- the dividend past it, doing subtractions (and incrementing the quotient)
- as needed. My first thought was to do the same thing, but that's messy
- to do in C. (In assembler, you would use LSL, ROXL to shift bits out
- of one register into another; but that doesn't map well to C.) So
- instead, I keep the dividend fixed, and shift the divisor right. Just as
- in the pencil and paper square root method, you take the digits (bits) of
- the argument two at a time. But since the quotient (a.k.a. the square
- root) is accumulating bits one at a time in the opposite direction, it
- ends up only shifting one bit at a time.
-
- Without further comment, my version (which is meant to be fast, not
- necessarily readable) follows:
-
- #define BITS 32
- // BITS is the smallest even integer such that 2^BITS > ULONG_MAX.
- // If you port this routine to a computer where 2^(BITS-1) > ULONG_MAX
- // (Unisys A-series, for example, where integers would generally be
- // stored in the 39-bit mantissa of a 48-bit word), unwind the first
- // iteration of the loop, and special-case references to 'half' in that
- // first iteration so as to avoid overflow.
-
- unsigned long lsqrt (unsigned long n) {
- // Compute the integer square root of the integer argument n
- // Method is to divide n by x computing the quotient x and remainder r
- // Notice that the divisor x is changing as the quotient x changes
-
- // Instead of shifting the dividend/remainder left, we shift the
- // quotient/divisor right. The binary point starts at the extreme
- // left, and shifts two bits at a time to the extreme right.
-
- // The residue contains n-x^2. (Within these comments, the ^ operator
- // signifies exponentiation rather than exclusive or. Also, the /
- // operator returns fractions, rather than truncating, so 1/4 means
- // one fourth, not zero.)
-
- // Since (x + 1/2)^2 == x^2 + x + 1/4,
- // n - (x + 1/2)^2 == (n - x^2) - (x + 1/4)
- // Thus, we can increase x by 1/2 if we decrease (n-x^2) by (x+1/4)
-
- unsigned long residue; // n - x^2
- unsigned long root; // x + 1/4
- unsigned long half; // 1/2
-
- residue = n; // n - (x = 0)^2, with suitable alignment
- root = 1 << (BITS - 2); // x + 1/4, shifted left BITS places
- half = root + root; // 1/2, shifted left BITS places
-
- do {
- if (root <= residue) { // Whenever we can,
- residue -= root; // decrease (n-x^2) by (x+1/4)
- root += half; } // increase x by 1/2
- half >>= 2; // Shift binary point 2 places right
- root -= half; // x{+1/2}+1/4 - 1/8 == x{+1/2}+1/8
- root >>= 1; // 2x{+1}+1/4, shifted right 2 places
- } while (half); // When 1/2 == 0, binary point is at far right
-
- // Omit the following line to truncate instead of rounding
- if (root < residue) ++root;
-
- return root; // Guaranteed to be correctly rounded (or truncated)
- }
-
- Enjoy,
- -Ron Hunsinger
-
- +++++++++++++++++++++++++++
-
- >From Ron_Hunsinger@bmug.org (Ron Hunsinger)
- Date: Sun, 8 May 94 04:45:40 PST
- Organization: Berkeley Macintosh Users Group
-
- trussler@panix.com (Tom Russler) writes:
- >
- > busfield@hurricane.seas.ucla.edu (John D. Busfield) writes:
- > > Does anyone have an algorithm or code for computing square roots.
- > >The emphasis is on speed rather than accuracy in that I only need the
- > >result rounded to the nearest integer (ie sqrt(1000) = 31 or 32).
- > >Thanks in advance.
- >
- >In 68000 assembly in THINK C:
- >
- > move.l number, d0
- > move.l #1, d1
- > move.l d0, d2
- > @1 cmp.l d1, d2
- > ble @2
- > lsl.l #1, d1
- > lsr.l #1, d2
- > bra @1
- > @2 add.l d2, d1
- > lsr.l #1, d1
- > divu d1, d0
- > add.w d1, d0
- > add.w #1, d0
- > lsr.w #1, d0
-
- Sorry. Valiant effort, but no cigar. You don't handle boundary
- conditions correctly. Examine what happens when 'number' is 0xFFFFaaaa,
- where aaaa is any arbitrary bit pattern.
-
- > move.l number, d0 ; d0 = 0xFFFFaaaa
- > move.l #1, d1 ; d1 = 0x00000001
- > move.l d0, d2 ; d2 = 0xFFFFaaaa
-
- > @1 cmp.l d1, d2
- > ble @2
- > lsl.l #1, d1
- > lsr.l #1, d2
- > bra @1
-
- ; after 16 iterations, you break out of the above loop, with:
- ; d1 = 0x00010000 = 65536
- ; d2 = 0x0000FFFF = 65535
-
- > @2 add.l d2, d1 ; d1 = 0x0001FFFF
- > lsr.l #1, d1 ; d1 = 0x0000FFFF = 65535
- > divu d1, d0 <-----OVERFLOW: d0/d1 >= 65536, to big for a short
-
- In fact, you should have seen that overflow was a problem. DIVU will
- always overflow, no matter what the divisor, if the dividend is greater
- than 65535*65535 = 0xFFFE0001.
-
- You may have been counting on the fact that, if overflow is detected,
- the operands are unaffected, but that doesn't help. The contents of
- d0.w after the divide is aaaa, which was a completely arbitrary value.
- Continuing:
-
- > add.w d1, d0 <-----OVERFLOW AGAIN: Since d1=65535, the
- > add.w #1, d0 ; effect of these two instruction is to
- ; overflow at some point, but otherwise
- ; do nothing. d0 still contains aaaa
-
- > lsr.w #1, d0 ; divide the garbage by 2.
-
- The final result is in the range 0..32767. The correct result should
- be 65535 (if aaaa is 0000) or 65536 (for any other value of aaaa).
- Again, you should have seen this coming. Your final instruction can
- never return a value >32767, meaning that you cannot possibly return
- the correct result for any number > 32767*32768 = 0x3FFF8000.
-
-
- You also do not achieve the stated accuracy goal, "I [only] need the
- result rounded to the nearest integer", even when you don't overflow.
-
- Suppose 'number' = 0x10004000 = 268451840. Your code computes a value
- of 16794. The correctly rounded value is 16384. Off by 410 is not "the
- nearest integer".
-
- -Ron "who gets very picky sometimes :)" Hunsinger
-
- +++++++++++++++++++++++++++
-
- >From fixer@faxcsl.dcrt.nih.gov (Chris Gonna' Find Ray Charles Tate)
- Date: Mon, 9 May 1994 14:04:10 GMT
- Organization: DCRT, NIH, Bethesda, MD
-
- In article <0014A722.fc@bmug.org>, Ron_Hunsinger@bmug.org (Ron Hunsinger) writes:
- >
- >zstern@adobe.com (Zalman Stern) writes:
- >>
- >>I threw in an optimization to use the PowerPC count-leading-zeros
- >>instruction to efficiently reduce the number of iterations. (There is a
- >>similar instruction on the 68K.)
- >
- >There is? Is this a 68040 instruction? I can't find any such instruction
- >at least through the 68030. I thought the 68040 was just a 68030 with
- >a bigger cache and a built-in FPU.
-
- I think you're looking for the BFFFO instruction, which my 680x0 family
- reference describes as "Find First One in Bit Field," and says it's
- a 60820 instruction (and later, of course).
-
- The book I'm looking in is Motorola's "M68000 Family Programmer's
- Reference Manual," which doesn't seem to have any Library of Congress
- registration (!), but which is Motorola part number M68000PM/AD.
-
- - -------------------------------------------------------------
- Christopher Tate | "So he dropped the heart -
- MSD, Inc. | the floor's clean."
- fixer@faxcsl.dcrt.nih.gov | - Sidney Harris
-
- +++++++++++++++++++++++++++
-
- >From christer@cs.umu.se (Christer Ericson)
- Date: Tue, 10 May 1994 10:24:38 GMT
- Organization: Dept. of Computing Science, Umea Univ., 901 87 Umea, Sweden
-
- In <00148DC5.fc@bmug.org> Ron_Hunsinger@bmug.org (Ron Hunsinger) writes:
- >
- >parkb@bigbang.Stanford.EDU (Brian Park) writes:
- >>[...suggests using Newton's method...]
- >
- >You can do square root much faster than this.
-
- I beg to differ. In article <CpA23t.DwM@cs.umu.se> I posted an 68000-
- based implementation of Newton's method (with quirks) which is something
- like 20-30% faster than the equivalent hand-coded optimized assembly
- version of the routine you posted in another article. It all depends on
- how good your initial guess for Newton's method is.
-
- So there! :-)
-
-
- Christer Ericson --- Internet: christer@cs.umu.se --- tel: +46-90-166794
- Department of Computer Science, University of Umea, S-90187 UMEA, SWEDEN
-
- +++++++++++++++++++++++++++
-
- >From afcjlloyd@aol.com (AFC JLloyd)
- Date: 10 May 1994 18:01:03 -0400
- Organization: America Online, Inc. (1-800-827-6364)
-
- In article <CpL0x4.HKJ@cs.umu.se>, christer@cs.umu.se (Christer Ericson)
- writes:
-
- >In <00148DC5.fc@bmug.org> Ron_Hunsinger@bmug.org (Ron Hunsinger) writes:
- >>
- >>parkb@bigbang.Stanford.EDU (Brian Park) writes:
- >>>[...suggests using Newton's method...]
- >>
- >>You can do square root much faster than this.
- >
- >I beg to differ. In article <CpA23t.DwM@cs.umu.se> I posted an 68000-
- >based implementation of Newton's method (with quirks) which is something
- >like 20-30% faster than the equivalent hand-coded optimized assembly
- >version of the routine you posted in another article. It all depends on
- >how good your initial guess for Newton's method is.
-
- I agree with Christer. Until this thread I had been using Newton's method
- with 5 iterations for obtaining the square root of unsigned longs, where
- the initial estimate came from finding the most signficant bit, using
- that to estimate Log2(n), and using that to estimate sqrt(n).
-
- When I saw this thread I tried the code attributed to Dr. John Bruner
- (similar to Ron's code) and saw a 40% speedup over my Newton's method
- code. Convinced that this was a precious find, I then tried to apply
- the same technique to taking the square root of Fixed point values and
- discovered that it couldn't beat my Newton's method code. The reason
- was primarily that 24 (instead of 16) iterations are required for
- fixed point square root (the Fract case would require 31 iterations).
-
- This caused me to do more thinking about Newton's method.
- The great thing about Newton's method is that it _doubles_ the
- precision with each iteration. Of course, double of nothing
- is nothing, so Newton's method is worthless if you give it a poor
- estimate, but given a good estimate Newton's method is fantastic.
-
- So, I decided to use a table lookup to provide the initial estimate for
- Newton's method. I found that I could reduce the number of iterations
- down from five to _one_ for the integer case, and just _two_ iterations
- for both the Fixed and Fract cases. The resulting code is significantly
- faster than any of my previous attempts.
-
- Jim Lloyd
- afcjlloyd@aol.com
-
-
- +++++++++++++++++++++++++++
-
- >From Ron_Hunsinger@bmug.org (Ron Hunsinger)
- Date: Sat, 14 May 94 03:01:56 PST
- Organization: Berkeley Macintosh Users Group
-
- christer@cs.umu.se (Christer Ericson) writes:
-
- >In <00148DC5.fc@bmug.org> Ron_Hunsinger@bmug.org (Ron Hunsinger) writes:
- >>
- >>parkb@bigbang.Stanford.EDU (Brian Park) writes:
- >>>[...suggests using Newton's method...]
- >>
- >>You can do square root much faster than this.
- >
- >I beg to differ. In article <CpA23t.DwM@cs.umu.se> I posted an 68000-
- >based implementation of Newton's method (with quirks) which is something
- >like 20-30% faster than the equivalent hand-coded optimized assembly
- >version of the routine you posted in another article. It all depends on
- >how good your initial guess for Newton's method is.
-
- You've got me mixed up with somebody else. I did say you can do square
- root much faster, and explained how, but I didn't post any assembly
- version. I posted the C version (which therefore has the advantage
- that it can be compiled native for PowerPC, or ported easily to another
- machine that doesn't even have to be binary, although of course it is
- unlikely that shifting will be fast on a non-binary machine).
-
- I saw two assembly versions. One, which was quite long, very tightly
- optimized, and attributed to Motorola, was indeed faster than my short
- simple C version, but not by all that much. Also, it always truncated,
- where mine would produce the correctly rounded result (or the truncated
- result by commenting out one line if that's really what you wanted).
-
- The other, very short assembler version was hopelessly flawed. It
- computed an initial guess that could be off by a factor of two, and
- then did *ONE* iteration of Newton's method, and stopped there, producing
- a value that was not even close (in absolute terms), even when it did
- not lose all significance due to overflow. If asked to take the square
- root of zero, it would divide by zero. It probably was fast (when it
- didn't crash), but who cares how quickly you get the wrong answer?
-
- Which version was yours?
-
- I'll be charitable and assume it was the Motorola version. I no longer
- have the original message, but I seem to recall the max running time
- was something like 1262 cycles on a 68000, and (quite a bit) faster
- on small values. I modified my C version to special-case small values
- too, and unwind the iterations of the main loop up to the iteration that
- finds the high bit of the root. The main benefit of special-casing small
- values is so I don't have to worry about that boundary in the remaining
- code -- I am then guaranteed that the (unrounded) root is at least two
- bits long. And although I am not displeased by the improved efficiency,
- my main reason for unwinding the first iterations was to remove the
- restriction in the prior version that it would only work on machines in
- which the number of bits in an unsigned long was even. This code will
- now work even on machines with odd integer sizes. (Yes, Virginia, such
- machines do exist! An example is the Unisys A-series computers, in which
- ALL arithmetic is done in floating point, and integers are just the special
- case where the exponent is zero and the integer value is stored in the
- 39-bit mantissa of a 48-bit word.)
-
- My improved C version appears at the end of this message.
-
- I compiled it with MPW C with no special optimizations, tested it
- thoroughly, and then disassembled the code and calculated the running
- time on a 68000. (On later machines, the presence of the cache makes
- accurate timings much more difficult to do by hand.) I did not make
- any attempt to "improve" the C-generated code, although there are some
- easy and obvious optimizations available. The running time for my
- version (the one that correctly rounds), counting the RTS and the
- (quite unnecessary) LINK/MOVEM.L/.../MOVE.L <rslt>,D0/MOVEM.L/UNLK is:
-
- If N <= 12 then 204+4N, else 620+46u+32z+6r, where
- N = input argument
- u = Number of one-bits in the unrounded root
- z = Number of non-leading zero-bits in the unrounded root
- r = 1 if rounding increases the root, else 0
-
- The actual running time for various values of N is:
-
- N Time Comments
- --------- ----- -------------------------
- 0 204 Best time
- 13 718 Best time not using lookup table
- 625 822 This is a cut-point for the Motorola code
- 0..65535 935 Average* time over inputs that fit in 16 bits
- 65535 994 Worst time if N fits in 16 bits
- 0..0xFFFFFFFF 1247 Average* time over all 32-bit inputs
- 0xFFFFFFFF 1362 Worst case
- *average times computed assuming uniform distrubution of outputs
-
- If my memory serves me right, this is only about 8% slower than the
- hand-crafted Motorola code, which I think is a cheap price to pay for
- the portability. Especially considering that most of that difference
- is due to the fact that they split the general loop into a main loop
- that could get away with word-size operations for some variables, and
- a few unwound iterations that had to use long-sized operands. Notice
- that this is counterproductive on 68020 and up, where long operations
- are just as fast as word operations when the operands are in registers,
- and the only effect of unwinding the last two iterations is to
- guarantee that they won't be in the cache.
-
- The compiled 68K code fits in 104 bytes (decimal). (96 bytes without
- the debugger symbol. This really was a vanilla compile.) How big was
- your code? How big is the code cache on your machine? How much of
- your caller's code do you push out of the cache?
-
-
- >So there! :-)
-
- Same to you!! :-)
-
-
- #define lsqrt_max4pow (1UL << 30)
- // lsqrt_max4pow is the (machine-specific) largest power of 4 that can
- // be represented in an unsigned long.
-
- unsigned long lsqrt (unsigned long n) {
- // Compute the integer square root of the integer argument n
- // Method is to divide n by x computing the quotient x and remainder r
- // Notice that the divisor x is changing as the quotient x changes
-
- // Instead of shifting the dividend/remainder left, we shift the
- // quotient/divisor right. The binary point starts at the extreme
- // left, and shifts two bits at a time to the extreme right.
-
- // The residue contains n-x^2. (Within these comments, the ^ operator
- // signifies exponentiation rather than exclusive or. Also, the /
- // operator returns fractions, rather than truncating, so 1/4 means
- // one fourth, not zero.)
-
- // Since (x + 1/2)^2 == x^2 + x + 1/4,
- // n - (x + 1/2)^2 == (n - x^2) - (x + 1/4)
- // Thus, we can increase x by 1/2 if we decrease (n-x^2) by (x+1/4)
-
- unsigned long residue; // n - x^2
- unsigned long root; // x + 1/4
- unsigned long half; // 1/2
-
- residue = n; // n - (x = 0)^2, with suitable alignment
-
- // if the correct answer fits in two bits, pull it out of a magic hat
- #ifndef lsqrt_truncate
- if (residue <= 12)
- return (0x03FFEA94 >> (residue *= 2)) & 3;
- #else
- if (residue <= 15)
- return (0xFFFEAA54 >> (residue *= 2)) & 3;
- #endif
- root = lsqrt_max4pow; // x + 1/4, shifted all the way left
- // half = root + root; // 1/2, shifted likewise
-
- // Unwind iterations corresponding to leading zero bits
- while (root > residue) root >>= 2;
-
- // Unwind the iteration corresponding to the first one bit
- // Operations have been rearranged and combined for efficiency
- // Initialization of half is folded into this iteration
- residue -= root; // Decrease (n-x^2) by (0+1/4)
- half = root >> 2; // 1/4, with binary point shifted right 2
- root += half; // x=1. (root is now (x=1)+1/4.)
- half += half; // 1/2, properly aligned
-
- // Normal loop (there is at least one iteration remaining)
- do {
- if (root <= residue) { // Whenever we can,
- residue -= root; // decrease (n-x^2) by (x+1/4)
- root += half; } // increase x by 1/2
- half >>= 2; // Shift binary point 2 places right
- root -= half; // x{+1/2}+1/4 - 1/8 == x{+1/2}+1/8
- root >>= 1; // 2x{+1}+1/4, shifted right 2 places
- } while (half); // When 1/2 == 0, bin. point is at far right
-
- #ifndef lsqrt_truncate
- if (root < residue) ++root; // round up if (x+1/2)^2 < n
- #endif
-
- return root; // Guaranteed to be correctly rounded (or truncated)
- }
-
- -Ron Hunsinger
-
- +++++++++++++++++++++++++++
-
- >From christer@cs.umu.se (Christer Ericson)
- Date: Mon, 16 May 1994 08:44:47 GMT
- Organization: Dept. of Computing Science, Umea Univ., 901 87 Umea, Sweden
-
- In <0014E819.fc@bmug.org> Ron_Hunsinger@bmug.org (Ron Hunsinger) writes:
- >
- >christer@cs.umu.se (Christer Ericson) writes:
- >>[...]
- >>I beg to differ. In article <CpA23t.DwM@cs.umu.se> I posted an 68000-
- >>based implementation of Newton's method (with quirks) which is something
- >>like 20-30% faster than the equivalent hand-coded optimized assembly
- >>version of the routine you posted in another article. It all depends on
- >>how good your initial guess for Newton's method is.
- >
- >You've got me mixed up with somebody else. I did say you can do square
- >root much faster, and explained how, but I didn't post any assembly
- >version. I posted the C version (which therefore has the advantage
- >that it can be compiled native for PowerPC, or ported easily to another
- >machine that doesn't even have to be binary, although of course it is
- >unlikely that shifting will be fast on a non-binary machine).
-
- Gosh! What if you read what I wrote above. I didn't claim your code was
- written in assembly. I refer rather clearly to highly optimized assembly
- code _equivalent_ to the C code you posted.
-
-
- >I saw two assembly versions. One, which was quite long, very tightly
- >optimized, and attributed to Motorola, was indeed faster than my short
- >simple C version, but not by all that much. Also, it always truncated,
- >where mine would produce the correctly rounded result (or the truncated
- >result by commenting out one line if that's really what you wanted).
- >
- >The other, very short assembler version was hopelessly flawed. It
- >computed an initial guess that could be off by a factor of two, and
- >then did *ONE* iteration of Newton's method, and stopped there, producing
- >a value that was not even close (in absolute terms), even when it did
- >not lose all significance due to overflow. If asked to take the square
- >root of zero, it would divide by zero. It probably was fast (when it
- >didn't crash), but who cares how quickly you get the wrong answer?
- >
- >Which version was yours?
-
- Neither of these. Instead of arguing about some program you thought I
- wrote, wouldn't it have been better (as well as making you look better)
- if you had looked my article up, or in the case it had expired at your
- site, requested a copy from me before making yourself silly in public?
-
- Instead of commenting on your rather irrelevant comments on some code
- that wasn't mine, I've appended below a copy of my code and if you
- care to compare it to your code, I'll take the time to respond to
- any comments you might have (size, speed, what have you). Fair?
-
- Finally, the point I wanted to make was that an optimized version of
- Newton's method very well beats an optimized version of the type of
- algorithm your program is a version of. I'm not interested in any
- "...b-b-b-but C is more portable..." arguments.
-
-
- - - cut here ---
- unsigned short ce_quick_sqrt(n)
- register unsigned long n;
- {
- asm {
- ;-------------------------
- ;
- ; Routine : ISQRT; integer square root
- ; I/O parameters: d0.w = sqrt(d0.l)
- ; Registers used: d0-d2/a0
- ;
- ; This is a highly optimized integer square root routine, based
- ; on the iterative Newton/Babylonian method
- ;
- ; r(n + 1) = (r(n) + A / R(n)) / 2
- ;
- ; Verified over the full interval [0x0L,0xFFFFFFFFL]
- ;
- ; Some minor compromises have been made to make it perform well
- ; on the 68000 as well as 68020/030/040. This routine outperforms
- ; the best of all other algorithms I've seen (except for a table
- ; lookup).
- ;
- ; Copyright (c) Christer Ericson, 1993. All rights reserved.
- ;
- ; Christer Ericson, Dept. of Computer Science, University of Umea,
- ; S-90187 UMEA, SWEDEN. Internet: christer@cs.umu.se
- ;
-
- ;-------------------------
- ;
- ; Macintosh preamble since THINK C passes first register param
- ; in d7. Remove this section on any other machine
- ;
- move.l d7,d0
- ;-------------------------
- ;
- ; Actual integer square root routine starts here
- ;
- move.l d0,a0 ; save original input value for the iteration
- beq.s @exit ; unfortunately we must special case zero
- moveq #2,d1 ; init square root guess
- ;-------------------------
- ;
- ; Speed up the process of finding a power of two approximation
- ; to the actual square root by shifting an appropriate amount
- ; when the input value is large enough
- ;
- ; If input values are word only, this section can be removed
- ;
- move.l d0,d2
- swap d2 ; do [and.l 0xFFFF0000,d2] this way to...
- tst.w d2 ; go faster on 68000 and to avoid having to...
- beq.s @skip8 ; reload d2 for the next test below
- move.w #0x200,d1 ; faster than lsl.w #8,d1 (68000)
- lsr.l #8,d0
- @skip8 and.w #0xFE00,d2 ; this value and shift by 5 are magic
- beq.s @skip4
- lsl.w #5,d1
- lsr.l #5,d0
- @skip4
-
- ;-------------------------
- ;
- ; This finds the power of two approximation to the actual square root
- ; by doing the integer equivalent to sqrt(x) = 2 ^ (log2(x) / 2). This
- ; is done by shifting the input value down while shifting the root
- ; approximation value up until they meet in the middle. Better precision
- ; (in the step described below) is gained by starting the approximation
- ; at 2 instead of 1 (this means that the approximation will be a power
- ; of two too large so remember to shift it down).
- ;
- ; Finally (and here's the clever thing) since, by previously shifting the
- ; input value down, it has actually been divided by the root approximation
- ; value already so the first iteration is "for free". Not bad, eh?
- ;
- @loop add.l d1,d1
- lsr.l #1,d0
- cmp.l d0,d1
- bcs.s @loop
- @skip lsr.l #1,d1 ; adjust the approximation
- add.l d0,d1 ; here we just add and shift to...
- lsr.l #1,d1 ; get the first iteration "for free"!
- ;-------------------------
- ;
- ; The iterative root value is guaranteed to be larger than (or equal to)
- ; the actual square root, except for the initial guess. But since the first
- ; iteration was done above, the loop test can be simplified below
- ; (Oh, without the bvs.s the routine will fail on most large numbers, like
- ; for instance, 0xFFFF0000. Could there be a better way of handling these,
- ; so the bvs.s can be removed? Nah...)
- ;
- @loop2 move.l a0,d2 ; get original input value
- move.w d1,d0 ; save current guess
- divu.w d1,d2 ; do the Newton method thing
- bvs.s @exit ; if div overflows, exit with current guess
- add.w d2,d1
- roxr.w #1,d1 ; roxr ensures shifting back carry overflow
- cmp.w d0,d1
- bcs.s @loop2 ; exit with result in d0.w
- @exit
- }
- }
-
- Christer Ericson --- Internet: christer@cs.umu.se --- tel: +46-90-166794
- Department of Computer Science, University of Umea, S-90187 UMEA, SWEDEN
-
- +++++++++++++++++++++++++++
-
- >From Ron_Hunsinger@bmug.org (Ron Hunsinger)
- Date: Tue, 24 May 94 23:23:06 PST
- Organization: Berkeley Macintosh Users Group
-
- christer@cs.umu.se (Christer Ericson) writes:
-
- >In <0014E819.fc@bmug.org> Ron_Hunsinger@bmug.org (Ron Hunsinger) writes:
- >>
- >>christer@cs.umu.se (Christer Ericson) writes:
- >>>[...]
- >>>I beg to differ. In article <CpA23t.DwM@cs.umu.se> I posted an 68000-
- >>>based implementation of Newton's method (with quirks) which is something
- >>>like 20-30% faster than the equivalent hand-coded optimized assembly
- >>>version of the routine you posted in another article. It all depends on
- >>>how good your initial guess for Newton's method is.
- >>
- >>You've got me mixed up with somebody else. I did say you can do square
- >>root much faster, and explained how, but I didn't post any assembly
- >>version. I posted the C version (which therefore has the advantage
- >>that it can be compiled native for PowerPC, or ported easily to another
- >>machine that doesn't even have to be binary, although of course it is
- >>unlikely that shifting will be fast on a non-binary machine).
- >
- >Gosh! What if you read what I wrote above. I didn't claim your code was
- >written in assembly. I refer rather clearly to highly optimized assembly
- >code _equivalent_ to the C code you posted.
-
- I did read what you wrote. You have to admit that it isn't all that
- clear. One could, and I did, read it that you thought I had posted an
- "equivalent hand-coded optimized assembly".
-
- But as long as we're on the subject of carefully reading each other's
- notes, lets look again at what I said. I said that Newton/Raphson was
- not the fastest ALGORITHM for taking square roots, because it uses
- several division steps, and there is another algorithm (which I described)
- which is as fast as a single division. I also made it clear that in
- making that comparison, I was comparing only similar implementations
- of the two operations: my sqare root method implemented entirely in
- software is as fast and about the same size as division implemented
- entirely in software; my square root method implemented entirely in
- hardware is as fast and uses about the same number of logic gates as
- division implemented entirely in hardware.
-
- Comparing my wholly-software IMPLEMENTATION to your implementation using
- hardware divide is unfair and irrelevant. I stand by my claim. Square
- root is no harder or slower than division. And there is an algorithm
- for square root that will beat Newton/Raphson hands down whenever both
- algorithms are given similar implementations.
-
- To back up my claim, consider that ANY software-only implementation of
- Newton/Raphson would have to, as a minimum, test for a zero input (to
- avoid dividing by zero), and do at least one division. That is, it would
- have to be at least as complex as the following routine to compute a
- sort-of reciprocal:
-
- unsigned long lrecip (unsigned long n) {
- return (n == 0) ? 0 : 0xFFFFFFFF / n;
- }
-
- On a 68000, the division has to be done in software, because the 68000
- does not have a DIVU.L instruction. Examining the code generated by
- this routine shows that its worst-case running time on a 68000 is 1312
- clocks. My square-root routine takes at most 1362 clocks. Close enough?
-
-
- >Finally, the point I wanted to make was that an optimized version of
- >Newton's method very well beats an optimized version of the type of
- >algorithm your program is a version of. I'm not interested in any
- >"...b-b-b-but C is more portable..." arguments.
-
- But I DO care that C is more portable, and it is definitely relevant to
- the comparison. Our versions were not optimized equivalently, because
- you are focusing only on optimization to a particular piece of silicon
- that happens to implement division, but not square root, in hardware.
- Take our implementations unchanged, and run them on a PowerPC, and mine
- will leave yours in the dust. ("Unchanged" means I get to compile mine
- native - that's what I bought when I paid the price of being portable.
- "Unchanged" means you have to have yours emulated - that's what you
- bought when you decided to write processor-dependent code.) Or take
- our implementations unchanged and run them on an Intel processor. Watch
- mine still run at acceptable speed. Watch yours not run at all.
-
- I also care that mine meets the design specs. The original requestor
- said he wanted to compute square root "rounded to the nearest integer".
- My method rounds, and rounds correctly in every case. Yours, and all
- the other methods I've seen, truncate. I can truncate too, if that's
- what you want, but fixing yours to round takes a little work. Essentially,
- rounding involves comparing the unrounded root with the remainder from
- the last division. If you implement Newton/Raphson in C or Pascal, that
- remainder is not available to you, and you have to do at least another
- multiply and subtract to get it. If done in 68K Assembler, the remainder
- is available, so rounding won't slow you down much, but it will complicate
- your logic some, and of course there's that portability issue you don't
- want to talk about. Either way, remember that rounding may produce a
- result of 65536, so the routine has to return an unsigned long, just
- like the requestor asked for, instead of an unsigned short.
-
- >I've appended below a copy of my code and if you
- >care to compare it to your code, I'll take the time to respond to
- >any comments you might have (size, speed, what have you). Fair?
-
- Well OK, as long as you keep in mind we're comparing apples to
- oranges.
-
- Before looking at speed, there is the more important issue of
- correctness. As nearly as I can tell, you always return the correct
- value (except for not rounding). I won't quite go so far as to say the
- code is all correct, because you make some early mistakes that get
- compensated for later:
-
- o You test for a zero input by doing a BEQ.S right after moving the
- input to register A0 for safekeeping. Are you aware that moving to
- an A register does not affect the condition codes? Fortunately,
- the condition codes are still set correctly from the prior move
- of the same data from D7 to D0. However, you have a comment that
- says that move can be eliminated. If someone does, they might be
- in trouble.
-
- o Your logic to do large initial shifts is faulty. On some inputs (like
- 0x02000000) you overshift. On other inputs (in the range 512..65535)
- I think you are not shifting as far as you think you are shifting.
- (The logic around label @skip8 is muddled.) Fortunately, the Newton
- method is robust enough to compensate for a bad initial guess, and
- you still arrive at the correct answer, although not as efficiently
- as you could have.
-
- Maybe you don't care about either of these points - after all, you already
- said you didn't care about portability - but has no one ever pointed out
- to you that is much easier to write provably correct code in a high level
- language? Does the fact that your code does not work the way you think it
- does suggest anything?
-
- It's hard to estimate your average running time, because there are
- so many paths through the optimization and it's difficult to assign
- proper weights to each path, so I'll settle for comparing worst cases.
-
- o Your worst case is with the input 0x02000000, for which you have to
- do 4 divisions. You take a total of 994 clocks on a 68000, or
- 459 clocks on a 68020.
-
- o My worst case is with the input 0xFFFFFFFF. I take 1362 clocks on
- a 68000, 609 on a 68020.
-
- 1362 / 994 = 1.37. You are 37% faster than I am on a 68000.
- 609 / 459 = 1.33. You are 33% faster than I am on a 68020.
-
- Truth be told, I'm not unhappy that my portable software solution fares
- so well against your non-portable hardware solution, even on the hardware
- of your choice. Your speed advantage stems from the fact that you are
- coding in Assembler, not from any advantage of the algorithm. If you
- implemented your algorithm in C, then either your code would be calling a
- division subroutine (and your performance would go to hell), or you
- could specifically tell the compiler to generate 68020 code to get
- inline divide instructions. But then, besides no longer running on a
- 68000, you would still come out behind because the compiler would be
- compelled to generate a DIVU.L instruction instead of your DIVU.W, and
- that difference alone is enough to dissipate all your advantage. Add
- the extra multiply you need to do correct rounding and you're behind.
-
- -Ron Hunsinger
-
- +++++++++++++++++++++++++++
-
- >From christer@cs.umu.se (Christer Ericson)
- Date: Thu, 2 Jun 1994 08:17:18 GMT
- Organization: Dept. of Computing Science, Umea Univ., 901 87 Umea, Sweden
-
- In <00155867.fc@bmug.org> Ron_Hunsinger@bmug.org (Ron Hunsinger) writes:
- >[...] I also made it clear that in
- >making that comparison, I was comparing only similar implementations
- >of the two operations: my sqare root method implemented entirely in
- >software is as fast and about the same size as division implemented
- >entirely in software; my square root method implemented entirely in
- >hardware is as fast and uses about the same number of logic gates as
- >division implemented entirely in hardware.
-
- OK, I can agree with that. But apart from that... well, see below.
-
-
- >Comparing my wholly-software IMPLEMENTATION to your implementation using
- >hardware divide is unfair and irrelevant. I stand by my claim. Square
- >root is no harder or slower than division. And there is an algorithm
- >for square root that will beat Newton/Raphson hands down whenever both
- >algorithms are given similar implementations.
-
- It's neither unfair nor irrelevant. You're living in a dream world.
- In the real world (where the rest of us live) there's no such thing
- as "similar implementations". The better (which equals to faster in
- this case) implementation wins, regardless of whether it uses all
- available machine instructions or not.
-
- In the real world do you always cry, "Not fair! Not fair! You used
- hardware division and I didn't <stomp of feet>."? Hey, good luck to you!
-
-
- >But I DO care that C is more portable, and it is definitely relevant to
- >the comparison. Our versions were not optimized equivalently, because
- >you are focusing only on optimization to a particular piece of silicon
- >that happens to implement division, but not square root, in hardware.
- >Take our implementations unchanged, and run them on a PowerPC, and mine
- >will leave yours in the dust. ("Unchanged" means I get to compile mine
- >native - that's what I bought when I paid the price of being portable.
- >"Unchanged" means you have to have yours emulated - that's what you
- >bought when you decided to write processor-dependent code.) Or take
- >our implementations unchanged and run them on an Intel processor. Watch
- >mine still run at acceptable speed. Watch yours not run at all.
-
- You're being silly! I use my code as posted on a 68K machine, I use another
- (general) piece of code on another machine. Ever heard the word #ifdef?
-
-
- >I also care that mine meets the design specs. The original requestor
- >said he wanted to compute square root "rounded to the nearest integer".
- >My method rounds, and rounds correctly in every case. Yours, and all
- >the other methods I've seen, truncate. I can truncate too, if that's
- >what you want, but fixing yours to round takes a little work. Essentially,
- >rounding involves comparing the unrounded root with the remainder from
- >the last division. If you implement Newton/Raphson in C or Pascal, that
- >remainder is not available to you, and you have to do at least another
- >multiply and subtract to get it. If done in 68K Assembler, the remainder
- >is available, so rounding won't slow you down much, but it will complicate
- >your logic some, and of course there's that portability issue you don't
- >want to talk about. Either way, remember that rounding may produce a
- >result of 65536, so the routine has to return an unsigned long, just
- >like the requestor asked for, instead of an unsigned short.
-
- I didn't post my article as a reply to the original post, but as a proof
- that a carefully coded N/B-approach very well outperforms (in terms of
- execution time) the algebraic algorithm your code is a version of.
-
-
- >>[I wrote:]
- >>I've appended below a copy of my code and if you
- >>care to compare it to your code, I'll take the time to respond to
- >>any comments you might have (size, speed, what have you). Fair?
- >
- >Well OK, as long as you keep in mind we're comparing apples to
- >oranges.
-
- I think it's interesting to note that when you compared your code to
- Jim Cathey's code (which you erroneously thought was mine), then you
- thought size was important. Was that only because your code was smaller
- than Cathey's? Now that my code is smaller than yours, it seems that
- you don't think size is important no longer. Nor did you bother to
- compare average times.
-
- And this from someone who cries "fair" all the time...
-
-
- >Before looking at speed, there is the more important issue of
- >correctness. As nearly as I can tell, you always return the correct
- >value (except for not rounding). I won't quite go so far as to say the
- >code is all correct, because you make some early mistakes that get
- >compensated for later:
- >
- > o You test for a zero input by doing a BEQ.S right after moving the
- > input to register A0 for safekeeping. Are you aware that moving to
- > an A register does not affect the condition codes? Fortunately,
- > the condition codes are still set correctly from the prior move
- > of the same data from D7 to D0. However, you have a comment that
- > says that move can be eliminated. If someone does, they might be
- > in trouble.
-
- Yes this is true. That comment is really a mistake. I added it in haste
- when posting the code to comp.sys.m68k some while ago, when there were
- a discussion of fast implementations of integer square roots. The comment
- is faulty. The code is, and was, correct, however.
-
-
- > o Your logic to do large initial shifts is faulty. On some inputs (like
- > 0x02000000) you overshift. On other inputs (in the range 512..65535)
- > I think you are not shifting as far as you think you are shifting.
- > (The logic around label @skip8 is muddled.) Fortunately, the Newton
- > method is robust enough to compensate for a bad initial guess, and
- > you still arrive at the correct answer, although not as efficiently
- > as you could have.
-
- Harsh words from someone who doesn't bother to attribute code to the
- correct person. There is nothing wrong with the section of code you're
- refering to. That code is there to reduce the number of iterations in
- the section just below, and to _reduce_it_as_much_as_possible_on_average_
- _without_over-complicating_the_code_.
-
- Now, say again, is it correct or is it faulty? If you consider it to
- be faulty, I'd very much like to see you post what you consider to be
- a correct version!
-
-
- >Does the fact that your code does not work the way you think it
- >does suggest anything?
-
- While we're being sarcastic, I'd like to ask you if the fact that you
- find errors that aren't there suggest anything to you?
-
-
- >It's hard to estimate your average running time, because there are
- >so many paths through the optimization and it's difficult to assign
- >proper weights to each path, so I'll settle for comparing worst cases.
- >
- > o Your worst case is with the input 0x02000000, for which you have to
- > do 4 divisions. You take a total of 994 clocks on a 68000, or
- > 459 clocks on a 68020.
- >
- > o My worst case is with the input 0xFFFFFFFF. I take 1362 clocks on
- > a 68000, 609 on a 68020.
- >
- >1362 / 994 = 1.37. You are 37% faster than I am on a 68000.
- >609 / 459 = 1.33. You are 33% faster than I am on a 68020.
-
- Well, since you didn't bother running the code, I did. On an IIcx I got
- the following results (all numbers in ticks):
-
- R = Ron's code, as posted
- A = Optimized assembly code equivalent to Ron's code
- C = My code, as posted
-
- [0..65536) [0..500000) 500000 "random" numbers
- R 142 1088 1135
- A 118 906 944
- C 95 626 473
-
- 142 / 95 = 1.49
- 1088 / 626 = 1.74
- 1135 / 473 = 2.4
-
- These figures should be considerably larger on a 68000, if we can believe
- your figures. (Note that you could squeeze an extra 20% out of your code,
- which for what is essentially a library routine is quite large. I would use
- my code though.)
-
-
- >Truth be told, I'm not unhappy that my portable software solution fares
- >so well against your non-portable hardware solution, even on the hardware
- >of your choice.
-
- You were saying?
-
-
- Christer Ericson --- Internet: christer@cs.umu.se --- tel: +46-90-166794
- Department of Computer Science, University of Umea, S-90187 UMEA, SWEDEN
-
- ---------------------------
-
- >From baer@qiclab.scn.rain.com (Ken Baer)
- Subject: How to save alpha in PICT files?
- Date: Sat, 28 May 1994 00:24:17 GMT
- Organization: SCN Research/Qic Laboratories of Tigard, Oregon.
-
- I need to save 32-bit images as PICT files that include the upper 8 bits
- of data (the alpha buffer). My code currently saves as 32-bit PICT files,
- but the upper 8-bit of data are missing! I am using the methods described
- in Inside Mac and developer notes. The sequence I use now is as follows:
- - allocate and write 32-bit data to an offscreen GWorld
- - OpenPicture()
- - CopyBits(offgworld, offgworld) (copy the pixmap to itself)
- - ClosePicture()
-
- Is there an extra writing step to jam in the alpha buffer? Is there any
- sample code (in C) with a more modern method? The sample code I'm working
- from is pretty dated.
-
- Thanks in advance.
-
- -Ken Baer.
- baer@qiclab.scn.rain.com
- --
- \_ -Ken Baer. Programmer/Animator, Hash Inc.
- <[_] Usenet: baer@qiclab.scn.rain.com/AppleLink:KENBAER/Office: (206)750-0042
- =# \, Animation Master: Finally, 3D animation software an artist can afford!
-
- +++++++++++++++++++++++++++
-
- >From zapper2@netcom.com (Chris Athanas)
- Date: Sun, 29 May 1994 09:19:37 GMT
- Organization: NETCOM On-line Communication Services (408 261-4700 guest)
-
- Ken Baer (baer@qiclab.scn.rain.com) wrote:
- : I need to save 32-bit images as PICT files that include the upper 8 bits
- : of data (the alpha buffer). My code currently saves as 32-bit PICT files,
- : but the upper 8-bit of data are missing! I am using the methods described
- : in Inside Mac and developer notes. The sequence I use now is as follows:
- : - allocate and write 32-bit data to an offscreen GWorld
- : - OpenPicture()
- : - CopyBits(offgworld, offgworld) (copy the pixmap to itself)
- : - ClosePicture()
-
- : Is there an extra writing step to jam in the alpha buffer? Is there any
- : sample code (in C) with a more modern method? The sample code I'm working
- : from is pretty dated.
-
- Under system 6.0, you could use the upper 8 bits as an alpha channel.
- >From what I understand, that was a bug.
-
- Current PICT files only support 24-bit images. The upper 8bits are
- truncated, and the picture is stored as 24-bits per pixel.
-
- To associate an alpha channel with a PICT, you will need to do some
- custom stuff. I'm not sure what your application is, but if you are doing
- everything yuorself, you could store the normal picture as a 24-bit pict
- file (normal PICT file) and create a PICT resource in that file that
- contains the alpha channel. This way "normal" applications can use your
- data-fork PICT file normally, and you can use your alpha channel in your
- own apps.
-
- Also, if you are planning on using photoshop, photoshop has a buil-in
- filter that allows you to take a PICT resource from a file. So your users
- could still have access to your "custom" alpha channel in photoshop.
-
- Another alternative is to use the photoshop standard as your file format.
- The ps format allows for as many channels as you would like. And
- photoshop is a pretty standard application now, and is widely used.
-
- Hope this helps.
-
- Chris Athanas
-
- +++++++++++++++++++++++++++
-
- >From baer@qiclab.scn.rain.com (Ken Baer)
- Date: Mon, 30 May 1994 17:51:58 GMT
- Organization: SCN Research/Qic Laboratories of Tigard, Oregon.
-
- In article <zapper2CqK4Kq.Fn2@netcom.com> zapper2@netcom.com (Chris Athanas) writes:
- >Ken Baer (baer@qiclab.scn.rain.com) wrote:
- >: I need to save 32-bit images as PICT files that include the upper 8 bits
- >: of data (the alpha buffer). My code currently saves as 32-bit PICT files,
- >: but the upper 8-bit of data are missing! I am using the methods described
- >: in Inside Mac and developer notes. The sequence I use now is as follows:
- >: - allocate and write 32-bit data to an offscreen GWorld
- >: - OpenPicture()
- >: - CopyBits(offgworld, offgworld) (copy the pixmap to itself)
- >: - ClosePicture()
- >
- >: Is there an extra writing step to jam in the alpha buffer? Is there any
- >: sample code (in C) with a more modern method? The sample code I'm working
- >: from is pretty dated.
- >
- >Under system 6.0, you could use the upper 8 bits as an alpha channel.
- >From what I understand, that was a bug.
-
- A damn useful bug. I guess 32-bit QuickDraw was a typo too :-(.
-
- >Current PICT files only support 24-bit images. The upper 8bits are
- >truncated, and the picture is stored as 24-bits per pixel.
-
- What's the logic behind this? Or as Seinfeld would say, "Who are the
- programming wizards that came up with this one?!?"
-
- >To associate an alpha channel with a PICT, you will need to do some
- >custom stuff. I'm not sure what your application is, but if you are doing
- >everything yuorself, you could store the normal picture as a 24-bit pict
- >file (normal PICT file) and create a PICT resource in that file that
- >contains the alpha channel. This way "normal" applications can use your
- >data-fork PICT file normally, and you can use your alpha channel in your
- >own apps.
-
- The application in question is a 3D renderer which normally outputs
- QuickTime (which has no problem dealing with real 32-bit data), but will
- also put out Targa, PICS, and PICT. I want PICT files with an alpha
- channel that can be used primarily in Photoshop, and in other applications
- that support 32-bit PICT files.
-
- >Also, if you are planning on using photoshop, photoshop has a buil-in
- >filter that allows you to take a PICT resource from a file. So your users
- >could still have access to your "custom" alpha channel in photoshop.
-
- Are you saying that Photoshop uses the alpha in a resource method that you
- described above? Is this documented anywhere?
-
- >
- >Another alternative is to use the photoshop standard as your file format.
- >The ps format allows for as many channels as you would like. And
- >photoshop is a pretty standard application now, and is widely used.
-
- I'd rather stick with PICT, though it is easily my leasy favorite image
- format. It's really a shame that the Mac standard format is also the
- least portable to other systems. It's a big deal to us since our
- application also runs on Windows and SGI.
-
- >
- >Hope this helps.
-
- Yes, it does. My big question is what is everyone else doing for saving
- alpha with PICT files?
-
- >
- >Chris Athanas
-
-
- --
- \_ -Ken Baer. Programmer/Animator, Hash Inc.
- <[_] Usenet: baer@qiclab.scn.rain.com/AppleLink:KENBAER/Office: (206)750-0042
- =# \, Animation Master: Finally, 3D animation software an artist can afford!
-
- +++++++++++++++++++++++++++
-
- >From scottsquir@aol.com (ScottSquir)
- Date: 30 May 1994 18:54:02 -0400
- Organization: America Online, Inc. (1-800-827-6364)
-
- In article <1994May30.175158.28837@qiclab.scn.rain.com>,
- baer@qiclab.scn.rain.com (Ken Baer) writes:
-
- > My big question is what is everyone else doing for saving
- >alpha with PICT files?
-
- Set the cmpCount in the GWorld pixmap to 4. It defaults to 3.
-
- If you resize or do other CopyBits manipulation it may get lost.
- -scott
-
-
- +++++++++++++++++++++++++++
-
- >From Darrin.Cardani@AtlantaGA.NCR.COM (Darrin Cardani)
- Date: Tue, 31 May 1994 18:17:38 GMT
- Organization: UNITY, AT&T GIS
-
- In article <1994May30.175158.28837@qiclab.scn.rain.com> baer@qiclab.scn.rain.com (Ken Baer) writes:
-
- >Are you saying that Photoshop uses the alpha in a resource method that you
- >described above? Is this documented anywhere?
-
- You can get all the docs on Photoshop's format from archive.umich.edu in the
- mac/graphics/util directory. (Might be mac/development directory, actually, I'm
- not sure).
-
- >>
- >>Another alternative is to use the photoshop standard as your file format.
- >>The ps format allows for as many channels as you would like. And
- >>photoshop is a pretty standard application now, and is widely used.
-
- >Yes, it does. My big question is what is everyone else doing for saving
- >alpha with PICT files?
-
- I'm currently saving them as 2 PICT files. I've debated using 1 file with 2
- PICT resources in it. I think I'll end up using the Photoshop method, because
- it adds usefulness to my program.
-
- Also, someone mentioned you could have as many channels as needed with the
- photoshop format, I believe you can only have up to 16, though.
-
- >--
- > \_ -Ken Baer. Programmer/Animator, Hash Inc.
- ><[_] Usenet: baer@qiclab.scn.rain.com/AppleLink:KENBAER/Office: (206)750-0042
- > =# \, Animation Master: Finally, 3D animation software an artist can afford!
-
- Really? What it is and can a demo be downloaded anywhere?
-
- Darrin
-
- Darrin Cardani
- Opinions above are mine. Mine! Mine! Mine!
- Darrin.Cardani@AtlantaGA.NCR.COM
-
- +++++++++++++++++++++++++++
-
- >From a ()
- Date: 2 Jun 1994 01:11:24 GMT
- Organization: Apple Computer, Inc., Cupertino, California
-
- In article <1994May30.175158.28837@qiclab.scn.rain.com>,
- baer@qiclab.scn.rain.com (Ken Baer) wrote:
-
- > In article <zapper2CqK4Kq.Fn2@netcom.com> zapper2@netcom.com (Chris Athanas) writes:
- > >Ken Baer (baer@qiclab.scn.rain.com) wrote:
- > >: I need to save 32-bit images as PICT files that include the upper 8 bits
- > >: of data (the alpha buffer). My code currently saves as 32-bit PICT files,
- > >: but the upper 8-bit of data are missing! I am using the methods described
- > >: in Inside Mac and developer notes. The sequence I use now is as follows:
- > >: - allocate and write 32-bit data to an offscreen GWorld
- > >: - OpenPicture()
- > >: - CopyBits(offgworld, offgworld) (copy the pixmap to itself)
- > >: - ClosePicture()
- > >
- > >: Is there an extra writing step to jam in the alpha buffer? Is there any
- > >: sample code (in C) with a more modern method? The sample code I'm working
- > >: from is pretty dated.
- > >
- > >Under system 6.0, you could use the upper 8 bits as an alpha channel.
- > >From what I understand, that was a bug.
- >
- > A damn useful bug. I guess 32-bit QuickDraw was a typo too :-(.
- >
- > >Current PICT files only support 24-bit images. The upper 8bits are
- > >truncated, and the picture is stored as 24-bits per pixel.
- >
- > What's the logic behind this? Or as Seinfeld would say, "Who are the
- > programming wizards that came up with this one?!?"
- >
- > >To associate an alpha channel with a PICT, you will need to do some
- > >custom stuff. I'm not sure what your application is, but if you are doing
- > >everything yuorself, you could store the normal picture as a 24-bit pict
- > >file (normal PICT file) and create a PICT resource in that file that
- > >contains the alpha channel. This way "normal" applications can use your
- > >data-fork PICT file normally, and you can use your alpha channel in your
- > >own apps.
- >
- > The application in question is a 3D renderer which normally outputs
- > QuickTime (which has no problem dealing with real 32-bit data), but will
- > also put out Targa, PICS, and PICT. I want PICT files with an alpha
- > channel that can be used primarily in Photoshop, and in other applications
- > that support 32-bit PICT files.
- >
- > >Also, if you are planning on using photoshop, photoshop has a buil-in
- > >filter that allows you to take a PICT resource from a file. So your users
- > >could still have access to your "custom" alpha channel in photoshop.
- >
- > Are you saying that Photoshop uses the alpha in a resource method that you
- > described above? Is this documented anywhere?
- >
- > >
- > >Another alternative is to use the photoshop standard as your file format.
- > >The ps format allows for as many channels as you would like. And
- > >photoshop is a pretty standard application now, and is widely used.
- >
- > I'd rather stick with PICT, though it is easily my leasy favorite image
- > format. It's really a shame that the Mac standard format is also the
- > least portable to other systems. It's a big deal to us since our
- > application also runs on Windows and SGI.
- >
- > >
- > >Hope this helps.
- >
- > Yes, it does. My big question is what is everyone else doing for saving
- > alpha with PICT files?
- >
- > >
- > >Chris Athanas
- >
- >
- > --
- > \_ -Ken Baer. Programmer/Animator, Hash Inc.
- > <[_] Usenet: baer@qiclab.scn.rain.com/AppleLink:KENBAER/Office: (206)750-0042
- > =# \, Animation Master: Finally, 3D animation software an artist can afford!
-
- The technique is described in IM VI pp 17-22 through 17-23; basically the
- idea is that before calling CopyBits you need to set the packType field in
- the source pixmap to four and the cmpCount to 4 also.
-
- Hope this really helps.
-
- Guillermo
-
- ---------------------------
-
- >From wingo@apple.com (Tony Wingo)
- Subject: Open Transport & ASLM
- Date: Thu, 2 Jun 1994 18:06:30 GMT
- Organization: Apple Computer
-
- The following is a response from the Open Transport Team to the recent
- threads on OpenTransport, ASLM and DLL's in general.
-
- Comments may be sent to opentpt@applelink.apple.com
-
- =====================================================================
-
- When Open Transport was developed, ASLM was the only general option
- for dynamic linking - CFM and SOM were not anywhere in sight. In
- fact, today we still can only use CFM for dynamic linking on PowerPC.
- It's not available for 68K, and SOM is not available for either
- platform.
-
- Whether or not ASLM is used as the dynamic linking mechanism, the
- fact that C++ objects generated by different compilers have different
- formats makes it virtually impossible to export object-oriented
- interfaces that everyone can use. Yes - SOM will solve this problem,
- but that solution does not come without a performance penalty -
- parameters have to be marshalled, and solving the fragile base class
- problem requires a much more indirect method of dispatching than a
- jump through a vtable.
-
- We really wanted to be able to create an object-oriented framework
- for writing STREAMS modules, and ASLM allowed us to do that without a
- performance penalty. Now that CFM and SOM are either here or on the
- horizon, we've backed off of that goal.
-
- >From a client perspective, Open Transport supports a complete
- procedural interface using Pascal calling conventions for use in the
- widest range of development environments possible. It also has a C++
- object interface, which is usable with MPW Cfront, and with the PPCC
- Script/tools for PowerPC. We believe that the C++ interface is also
- usable from Symantec C++ for 68K, but this has not been tested. We
- have also not tested with the Metrowerks products, but are assuming
- that, like Symantec, they have a way to import MPW .o and XCOFF
- files.
-
- ASLM is currently used as the dynamically loading mechanism for both
- client code and STREAMS modules, but neither version REQUIRES using
- C++ interface.
-
- When SOM becomes available as part of the system, we will look at
- converting our C++ client interfaces to become SOM objects. The
- procedural interfaces for STREAMS modules will remain ASLM-based for
- 68K and CFM-based for PowerPC.
-
- The bottom line is: If you can't get Open Transport client files to
- link with your tools using our "C" interfaces, then either your tools
- don't support MPW-generated libraries, your tools can't deal with
- Pascal calling conventions, or the Open Transport team has a bug.
-
- We have not yet published any information on how to write a STREAM
- module for Open Transport. As soon as we do, it will tell you how to
- create both ASLM and CFM-based STREAM modules.
-
- I hope this clears up any misunderstandings on using Open Transport.
-
- ---------------------------
-
- >From kennedy@fenton.cs.udcavis.edu (Kennedy)
- Subject: Sending the Mac Screen Image
- Date: Thu, 2 Jun 1994 01:56:33 GMT
- Organization: University of California, Davis
-
-
- This might be too involved of a question for this news group
- But I was wondering if anyone had any Ideas. What I need to do is
- send screen image of a Macintosh so that it can be viewed in all
- its glory in an X window. The screen should refresh flicker free
- and should look to the viewer of the X window as is he/she was
- looking at the Mac monitor. This has to be done using TCP/IP.
- If you have any Ideas, example code, algorithms, or whether
- you think it can be done or not, let me know. Thanx.
-
- Brian Kennedy (kennedy@fenton.cs.ucdavis.edu)
-
-
- +++++++++++++++++++++++++++
-
- >From rmah@panix.com (Robert S. Mah)
- Date: Thu, 02 Jun 1994 01:22:22 -0500
- Organization: One Step Beyond
-
- kennedy@fenton.cs.udcavis.edu (Kennedy) wrote:
-
- > This might be too involved of a question for this news group
- > But I was wondering if anyone had any Ideas. What I need to do is
- > send screen image of a Macintosh so that it can be viewed in all
- > its glory in an X window. The screen should refresh flicker free
- > and should look to the viewer of the X window as is he/she was
- > looking at the Mac monitor. This has to be done using TCP/IP.
- > If you have any Ideas, example code, algorithms, or whether
- > you think it can be done or not, let me know. Thanx.
-
- Since you said "in all its glory" I take it you mean color.
-
- Let's see...640x480 minimum screen size means 300K per frame. At 24 fps,
- that's around 7MBytes per second. Too much.
-
- You'll have to send screen differences (i.e. just what changed on the
- screen) to get a decent frame rate. Keep a copy of the screen image on
- the Mac, note the changes, send it via TCP/IP. Receive it on the UNIX
- machine. Have the X Server send it to the X Client.
-
- You could also patch all of QuickDraw and send the QD command opcodes to
- the UNIX box. There, you'd have to write a QD emulator and execute the
- opcodes as they come in. This would be very hard to write, but would
- work faster than the screen difference method.
-
- Have you considered just buying it. I think there are a few products
- that do this out there.
-
- Cheers,
- Rob
- ___________________________________________________________________________
- Robert S. Mah -=- One Step Beyond -=- 212-947-6507 -=- rmah@panix.com
-
- +++++++++++++++++++++++++++
-
- >From David K}gedal <davidk@lysator.liu.se>
- Date: 2 Jun 1994 11:44:58 GMT
- Organization: Lysator Academic Computer Society
-
- In article <rmah-020694012222@rmah.dialup.access.net> Robert S. Mah,
- rmah@panix.com writes:
- >You'll have to send screen differences (i.e. just what changed on the
- >screen) to get a decent frame rate. Keep a copy of the screen image on
- >the Mac, note the changes, send it via TCP/IP. Receive it on the UNIX
- >machine. Have the X Server send it to the X Client.
-
- Oops, this is a very common confusion. It is of course the X client that
- will send it to the X server. The server is the one that controls the
- actual screen and does the drawing that the clients tells it to do.
-
- /David
-
- +++++++++++++++++++++++++++
-
- >From d88-jwa@mumrik.nada.kth.se (Jon Wdtte)
- Date: 2 Jun 1994 12:56:48 GMT
- Organization: The Royal Institute of Technology
-
- In <rmah-020694012222@rmah.dialup.access.net> rmah@panix.com (Robert S. Mah) writes:
-
- >Let's see...640x480 minimum screen size means 300K per frame. At 24 fps,
- >that's around 7MBytes per second. Too much.
-
- You only need send what's drawn, and you can compress it while you
- send.
-
- There are commercial apps that do EXACTLY this. There are two
- mechanisms you can use; patching the QuickDraw bottlenecks, or
- patching ShieldCursor().
-
- QuickDraw does all drawing through bottlenecks, so if you want
- object graphics, send the bottleneck commands across the network.
- You have to send port information as well, I guess.
-
- QuickDraw also calls ShieldCursor() for the enclosing rectangle of
- what it draws, and ShowCursor() when it's done, so if you only
- want a bitmap, use that area, and send the (packed) bitmap once
- drawing is done (ShowCursor() time) or accumulate it all into a
- region which you send at WaitNextEvent time.
- --
- -- Jon W{tte, h+@nada.kth.se, Mac Software Engineer Deluxe --
- This signature is kept shorter than 4 lines in the interests of UseNet
- S/N ratio.
-
- +++++++++++++++++++++++++++
-
- >From rmah@panix.com (Robert S. Mah)
- Date: Thu, 02 Jun 1994 11:40:51 -0500
- Organization: One Step Beyond
-
- David K}gedal <davidk@lysator.liu.se> wrote:
-
- > Robert S. Mah, rmah@panix.com writes:
- > >You'll have to send screen differences (i.e. just what changed on the
- > >screen) to get a decent frame rate. Keep a copy of the screen image on
- > >the Mac, note the changes, send it via TCP/IP. Receive it on the UNIX
- > >machine. Have the X Server send it to the X Client.
- >
- > Oops, this is a very common confusion. It is of course the X client that
- > will send it to the X server. The server is the one that controls the
- > actual screen and does the drawing that the clients tells it to do.
-
- Right. I don't do X, so I'm more than a bit hazy about this stuff, but
- after you mentioned it, I did recall reading an article about this and
- saying to myself, "That's supposed to make sense?"
-
- So, to correct myself...the X-client would ask the X-server to draw the
- stuff. Hey...couldn't you simply put the X-client on the Mac? Could
- be an interesting academic project.
-
- Cheers,
- Rob
- ___________________________________________________________________________
- Robert S. Mah -=- One Step Beyond -=- 212-947-6507 -=- rmah@panix.com
-
- ---------------------------
-
- >From jaffe@rabbit.cita.utoronto.ca (Andrew Jaffe)
- Subject: What are Universal Headers?
- Date: Thu, 19 May 1994 13:49:41 GMT
- Organization: CITA
-
-
- That's about it:
- What are the Universal Headers?
-
- AJ
- --
- - --------------------------------------------------------------------
- Andrew Jaffe jaffe@cita.utoronto.ca
- CITA, U. Toronto (416) 978-8497, 6879
- 60 Saint George St. -3921 (Fax)
- Toronto, ON M5S 1A1 CANADA
-
- +++++++++++++++++++++++++++
-
- >From dean@genmagic.com (Dean Yu)
- Date: 19 May 1994 17:20:31 GMT
- Organization: General Magic, Inc.
-
- In article <JAFFE.94May19094941@rabbit.cita.utoronto.ca>,
- jaffe@rabbit.cita.utoronto.ca (Andrew Jaffe) wrote:
- >
- > That's about it:
- > What are the Universal Headers?
- >
-
- Universal Headers are the twisted idea of a couple of system software
- engineers who have since left the company to try to avoid the proliferation
- of interface files as the Mac API was going cross platform. That is, rather
- than having a set of header files you compile with if you were doing say a
- PowerPC application, and different set you use if you were doing a 68K
- application, you would have one set that can be used for any platform that
- supplied the Mac API by flipping a couple of compile time switches.
- Like any grand unifying theory, it was very elusive, long in the coming,
- and probably a little less all-encompassing than it should have been.
-
- -- Dean Yu
- Negative Ethnic Role Model
- General Magic, Inc.
-
- +++++++++++++++++++++++++++
-
- >From Lars.Farm@nts.mh.se (Lars Farm)
- Date: Fri, 20 May 1994 09:43:17 +0100
- Organization: Mid Sweden University
-
- In article <dean-190594101355@dean_yu.genmagic.com>, dean@genmagic.com
- (Dean Yu) wrote:
-
- > In article <JAFFE.94May19094941@rabbit.cita.utoronto.ca>,
- > jaffe@rabbit.cita.utoronto.ca (Andrew Jaffe) wrote:
- > >
- > > That's about it:
- > > What are the Universal Headers?
- > >
- >
- > Universal Headers are the twisted idea of a couple of system software
- [...]
- > Like any grand unifying theory, it was very elusive, long in the coming,
- > and probably a little less all-encompassing than it should have been.
-
- Still, it is very nice to have _one_ set of Mac headers, after all there is
- only one API. Now it's a lot simpler to move code between compilers. This
- can't be bad.
-
- --
- Lars.Farm@nts.mh.se
-
- +++++++++++++++++++++++++++
-
- >From slosser@apple.com (Eric Slosser)
- Date: Sat, 28 May 1994 21:14:40 GMT
- Organization: Apple Computer, Inc.
-
- > In article <dean-190594101355@dean_yu.genmagic.com>, dean@genmagic.com
- > (Dean Yu) wrote:
- >
- > >
- > > Universal Headers are the twisted idea of a couple of system software
- > [...]
- > > Like any grand unifying theory, it was very elusive, long in the coming,
- > > and probably a little less all-encompassing than it should have been.
-
- Well, with the kind of people they had working on the project, what did you
- expect, Dean?
-
- PS. Don't forget the really novel idea of having the C, Pascal, and Asm
- headers auto-generated from a master file, rather than maintained by hand.
-
- --
- Eric Slosser / Apple Computer / slosser@apple.com
-
- +++++++++++++++++++++++++++
-
- >From jum@anubis.han.de (Jens-Uwe Mager)
- Date: Thu, 2 Jun 94 01:05:11 MET
- Organization: (none)
-
-
- In article <slosser-280594131440@mac96.kip.apple.com> (comp.sys.mac.programmer), slosser@apple.com (Eric Slosser) writes:
-
- > PS. Don't forget the really novel idea of having the C, Pascal, and Asm
- > headers auto-generated from a master file, rather than maintained by hand.
-
-
- Hmmm, how does your specification language look like?
-
- ______________________________________________________________________________
- Jens-Uwe Mager jum@anubis.han.de
- 30177 Hannover jum@helios.de
- Brahmsstr. 3 Tel.: +49 511 660238
-
- ---------------------------
-
- End of C.S.M.P. Digest
- **********************
-
-
-